2007 Protein Structure Initiative "Bottlenecks" Workshop

March 19-20, 2007

Natcher Conference Center
NIH Campus, Bethesda, MD

One of the major goals of Protein Structure Initiative (PSI) is to develop, apply and disseminate methods for faster, cheaper and more reliable determination of protein structure. During the five year initial phase of this effort the PSI pilot centers have made a significant progress in establishing high throughput protein structure determination pipelines using x-ray crystallography and NMR. The analysis of these pipelines continues to highlighted important bottlenecks. As progress in a particular area is made new technical challenges emerge. To facilitate an exchange of developments and advancements among participants in the PSI, the National Institute of General Medical Sciences (NIGMS) has organized annual workshops on gene cloning, protein expression, purification, protein crystallization and rapid NMR structure determination. The first workshop was held in March 2002. The 2007 edition of this PSI "Bottlenecks" Workshop was held in the Natcher Conference Center on the NIH campus in Bethesda, Maryland, on March 19-20, 2007. More that one hundred and twenty scientists from the current second production phase of the PSI gathered for two days to focus on key topics that still remain major challenges in structural genomics and structural biology. This number included participants from the four Large Scale Production Centers, from the six Specialized Centers and investigators supported by research projects grants, i.e. R01, R21. SBIR and STTR projects. As in past successful workshops the emphasis was the sharing of both failures and successes. The main goal of providing an effective platform for scientists to exchange ideas and data, to discuss progress and problems, and to encourage contacts and collaborations among participating groups.

Dr. Jeremy Berg, the director of NIGMS, opened the workshop and welcomed the participants. Dr. John Norvell, director of the PSI introduced the workshop and discussed the current progress and the future of the PSI program particularly in the context of the present production phase of the effort. In outline, the program on the first day began with an invited presentation from Dr. Shibu Yooseph, of the J Craig Venter Institute, Rockville Maryland and also included invited accounts of progress in Structural Genomics projects in Europe by Drs. David Stuart and Aled Edwards and in Japan by Dr. Takashi Yabuki. The remainder of the first day of the workshop consisted in the presentation of current progress by the PSI centers. A total of eighty poster abstracts were submitted by meeting attendees. Sixteen of these abstracts were chosen by the organizing committee to constitute the oral program for the second day of the workshop. The remaining abstracts were presented as posters in a session of the first day.

The notes which follow provide an account of the oral program for the 2007 PSI "Bottlenecks" as prepared by members of the organizing committee, Brian G. Fox, University of Wisconsin, Celia Goulding, University of California Irvine, Michael G. Malkowski, Hauptman-Woodward Medical Research Institute, Lance Stewart, deCODE biostructures and Ashley Deacon, Stanford Synchrotron Radiation Laboratory.

Monday March 19, 2007 8:00 AM to 12:00 PM

Session 1 Bottlenecks and Technology for Structural Genomics

Shibu Yooseph
Exploring Microbial Diversity Using Metagenomics (Keynote Presentation)

Dr. Shibu Yooseph of the J.Craig Venter Institute presented metagenomics DNA sequence data gathered from marine microbial samples taken by the Sorcerer II Global Ocean Sampling (GOS) Expedition which sampled 41 different ocean sites around the world starting in Nova Scotia and ending in the Galapagos Islands. Dr. Yooseph noted that Microbes are found in such varied environments, from hot springs to the gastrointestinal tract, and have important roles in various natural processes. However, most of our knowledge on microbes is based on those few (estimated to be <1%) that we have been able to cultivate in the laboratory. Thus, we are far from a full understanding of microbial diversity. The rapidly emerging field of metagenomics seeks to understand the roles and interactions of organisms in an ecosystem by examining their genomic content. Shotgun sequencing of microbial communities offers a cultivation independent approach to studying the genomes of the various organisms within in such communities.

The GOS sequence data on microbial samples was derived from 7.7 million shotgun sequence reads resulting in 6.3 Gb of sequence information which was then assembled with a modified Celera gene sequence assembler. Due to the complexity of the data, very few genomes could be fully assembled. Nevertheless, the metagenomic sequence data revealed the incredible diversity and heterogeneity in naturally occurring marine microbial populations. An extensive 2 year analysis of the GOS sequence data revealed that over 6 million proteins could be predicted from the 6.3 Gb of marine microbial DNA sequence (91% bacterial / prokaryotic, ~3% eukaryotic, ~3% viruses), nearly doubling the number of known protein sequences. Protein encoding sequences were predicted as reasonably sized open reading frames of 60 amino acids or longer, bounded by translation start and stop signals with clear selective pressure at the codon level, measured as the ratio of less than 1 for non-synonymous to synonymous substitutions on DNA sequences (Ka/Ks). Importantly, the sequence data indicate that there is no sign of reaching a saturation point in the rate of protein discovery. As more sequence data is generated, more new proteins are identified at a constant rate without diminishing return.

A comparison of the GOS microbial proteins to the NCBI non-redundant protein data base revealed that the marine microbial proteins cover nearly all known prokaryotic protein families, and appear to add over 1,700 families of new protein families (Published in Rusch et al. PLoS Biol. 2007 Mar 13;5(3):e77). Interestingly, no immunoglobulins were found in the marine microbial DNA sequences, and flagellae and pili appear to have been lost from marine bacteria. Marine bacteria appear to have a very large number of UV damage repair enzymes such as photolyases, which may be an evolved adaptation to constant open exposure to sunlight. Some protein families such as the indoleamine 2,3 dioxygenase proteins appear in the marine microbial DNA, yet were heretofore previously thought to be specifically isolated to eukaryotic genomes. Similarly, Ribosomal LX proteins appear in the marine bacterial DNA, which were previously thought to only be encoded by Archaebacterial genomes. Based on sequence comparisons to pfam families of known structure, approximately 46% of GOS microbial proteins can be assigned a fold and have their 3D structures modeled. Over 16,000 GOS sequences can be classified into the 20 different protein kinase-like families, doubling the number of kinases in these families. In addition, over 6,000 previously defined ORFans, proteins of unknown function, have GOS microbial homologues. Dr. Yooseph noted that the GOS microbial sequence data reaffirms our growing wealth and complexity of protein sequence data and paucity of understanding regarding the biological systems of the oceans. Metagenomic sequencing projects are expected to discover protein sequences at pace that will dwarf any anticipated experimental structural coverage anticipated from global structural genomics efforts in aggregate.

Gaetano Montelione
Northeast Center for Structural Genomics Technologies For Structural Proteomics

Dr. Gaetano Montelione of Rutgers University and the Robert Wood Johnson Medical School, Principal Investigator of the Northeast Structural Genomics Consortium (NESG), described how the NESG project is the combination of strong parallel efforts in preparing samples for both X-ray crystallography and solution-state NMR spectroscopy, and a rigorous evaluation of the complementary values of these methods for genomic-scale structure analysis. NESG focuses on protein domain families targeted from the proteomes of eukaryotic model organisms and human. The targets are selected as representatives from eukaryotic protein domain sequence families so as to provide broad "coverage", as well as human proteins that are involved in biological networks implicated in cancers and developmental diseases. Approximately 2/3 of the NESG structures are produced by X-ray crystallography and the other 1/3 of the structures are solved by NMR methods which is more successful for structure determination of basic proteins that are smaller than 20 kDa.

The NESG is making use of synthetic genes to ensure that full length constructs can be obtained for selected targets, especially those targets where available genomic or cDNA clones do not represent full length open reading frames, or sequence database annotations of initiator methinonines is inaccurate (a common occurrence for eukaryotic targets of unknown function). The use of expression optimized synthetic genes at has been applied to the Human BioNet Target project at NESG with a 64% success rate for expression of synthetic gene constructs in E. coli, and a 36% success rate for soluble protein expression. Careful examination of eukaryotic gene sequences to accurately select full length open reading frames for expression, together with the use of codon-optimized synthetic genes has significantly improved the success rate for successful eukaryotic protein production. In combination with synthetic gene design, the NESG makes extensive use of protein construct engineering to arrive at optimized constructs for both crystallography and NMR. Meta server software system is used to obtain and integrate the predictions of both order / disorder in protein structure from the amino acid sequence information. The output of several different web-available structure prediction algorithms is integrated. The NESG is using H/DXMS to identify regions of protein which are disordered. This information is used to direct construct engineering through a series of N- and C- terminal extensions or truncations. The use of H/DXMS is especially useful to for resolving an understanding of protein disorder when NMR data show disorder yet structure prediction algorithms predict order.

Dr. Montelione explained that NESG has also established a "BioNet" systems biology approach for targeting networks of interacting proteins involved in human diseases, including networks of proteins involved in cancer biology. The project has developed an integrated and standardized "high-throughput" process for structure and function analysis of small proteins, generating some 12 – 15 purified proteins per week, in tens-of-milligram quantities, for crystallization and NMR experiments. Recent improvements in NMR probe technology have enabled NMR structure determination on as little as ~70 ug of protein in 5-30 ul sample using the Bruker 1.7 mm microprobe. The microprobe data collection recording time took twice as long as more conventional NMR experiments, but requires ~20-fold less protein than is typically required for NMR structure determination with a 5 mM CryoProbe. The combination of the NESG robotic expression system and truncation screening strategy with new NMR microprobe technology, and single protein labeling production from ACA-less synthetic genes (see Inouye below) should help to continue the improvement of success rates for gene to structure by NMR methods. In addition, NESG researchers are also collaborating with the protein NMR community in he measurement of residual dipolar couplings (RDC) for proteins in partially oriented systems (polyethylene glycols, gels, micelles, membranes, virus particles, etc.). The NESG is collecting RDC data (measure of the orientation of bond vectors relative to orientation of the molecule) on all of its protein samples, and have made the data publicly available as community resource to help validate NMR structures.

The NESG has set up a sophisticated protein structure validation meta-server (www-nmr.cabm.rutgers.edu/psvs) that runs a variety of widely used structure quality assessment algorithms and produces data output report that is suitable for publication (Proteins 2007 66:p778). In order to make the comparison of X-ray crystal structures and NMR structures more robust, the NESG has developed back calculation methods for comparing protein models to NMR data to derive an R-factor measuring how well the model fits the NMR data, similar to a crystallographic R-factor.

Michael Sauder
NYSGXRC: Progress, Technology, And Bottlenecks

Dr. Michael Sauder of SGX Pharmaceuticals and the New York Structural Genomix Research Center (NYSGXRC) presented the key technologies being applied within the NYSGXRC to overcome bottlenecks and improve gene to structure productivity. The NYSGXRC targets include PSI designated pfam targets, a variety of phosphatases as part a biomedical theme project, community nominated enolases/amidohydrolases, and also metagenomic targets from microbial communities collected from the Sargasso Sea. The NYSGXRC is now deploying the use of synthetic genes as the first step in the gene to structure path. So far, 30 new structures have been solved from proteins encoded by synthetic gene constructs. The use of synthetic gene constructs simplifies the cloning strategies used to obtain base constructs and is essential for all Sargasso Sea metagenomic targets which exist only as sequence information and are not available as genomic or cDNA clones. The synthetic gene approach also enables the simplified extensive engineering of mutagenized constructs for difficult targets whose domain architecture is complicated and poorly predicted. For example a novel haloacid dehalogenase structure recently solved by the NYSGXRC required extensive construct engineering. For the phosphatases, synthetic gene engineering of an active site residue Cysteine to Serine point mutation is very beneficial for crystallizing phosphatases; as well as the use of internal deletions of unstructured insertion sequences predicted from sequence homology studies of the phosphatase superfamily.

Synthetic gene engineering is also being deployed to improve the methionine content in targets that might otherwise be difficult to solve the structure of without additional methionines used for SeMet labeling in conjunction with anomalous diffraction data collection. The NYSGXRC has recently shifted to a mode of operation where all proteins are produced for crystallization trials with SeMet label. Given the high cost of SeMet substrate, the economics of this new approach required that the NYSGXRC conduct studies on the efficiency of SeMet incorporation using the Medicilon HY ("Pink") medium pioneered by the MCSG. The preferred concentration of the SeMet label was found to be 90 mg/l for most proteins except for those with more than 10 methionines which required 120 mg/l SeMet to ensure optimal labeling.

Reductive methylation is also used to by the NYSGXRC to chemically modify proteins for improved crystallization with a ~10% success rate for generating crystals from proteins that were otherwise not producing crystals. In addition, Dr. Sauder described the utility of robotic seeding to improve crystallization outcomes. For this seeding method, the most promising first pass crystallization hits (microcrystals and or quasi-crystalline precipitate) are crushed in solution and then used as an additive in follow on crystallization experiments set up with robotic liquid handlers. Early successes with this approach suggest that it may be of general utility for improving crystallization success outcomes.

The NYSGXRC first uses E. coli for expression of protein targets with N-terminal His6 tags. If this results in poor expression, then construct engineering with terminal deletions or point mutations is applied together with the production of the engineered constructs as N-terminal fusions to the His6-Smt tag, a ubiquitin-like protein that can be efficiently removed with the Ulp protease to produce the protein of interest without any extra tag-encoded amino acids. The NYSGXRC has begun to use recombinant baculovirus-insect cell expression systems to produce difficult high value eukaryotic proteins which do not express well in E. coli, including several human phosphatases. The LIMS system used by the NYSGXRC is capable of tracking all samples from gene to structure, and generates web viewable reports for ease of data sharing.

James Love
New York Consortium on Membrane Protein Structure

Dr. James Love of the New York Structural Biology Center and the New York Consortium on Membrane Structure (NYCOMPS) presented the research approach adopted by NYCOMPS for the gene to structure research on integral membrane proteins. The NYCOMPS pipeline starts with hundreds of selected bacterial integral membrane protein targets with non-redundant predicted domain structure from a variety of species (~20-30 orthologs from different genomes). The genes encoding the target open reading frames are cloned into an IPTG inducible Kanamycin resistant pET-24a E. coli expression vector (Ampicillin resistance is not preferred for membrane protein production in E. coli). The membrane protein open reading frames are cloned into this vector with their native N-terminus unaltered but with a monomeric E-Green Fluorescent Protein (EGFP)-His10 fusion at their C-terminus. An engineered Tev protease cleavage site at the C-terminus of the target protein allows the EGFP-His10 tag to be proteolytically removed prior to crystallization trials. All cloning and expression testing is conducted within a NYCMPS centralized core lab that has installed the Sesame database system for gene to structure tracking developed by the CESG. The cloning core lab performs parallel small scale bacterial culture expression testing by IPTG induction at 37 oC and 18 oC. Following protein production, the cultures are robotically lysed by sonication, and cell extracts are monitored for recombinant protein production by SDS-PAGE and anti-EGFP immunoblotting. High expressing constructs are then subjected to milti-well block plate culture and semi-automated small scale preparation of membrane purification by ultracentrifugation. Purified membranes are then subjected to a panel of detergents to determine which one is optimal for the successful solubilization of the protein as monitored by small scale HPLC analysis with fluorescence monitoring of the EGFP tag. Many expressed membrane proteins appear to be sensitive to proteolytic breakdown, visualized as sub-full length protein products in immunoblot analysis of HPLC eluates, underscoring the difficulties of working with integral membrane proteins. Nevertheless, progress is being made to introduce purified detergent solubilized membrane proteins into lipidic cubic phase material for examination by cryo-electron microscopy. The identification of 2D crystals will allow the use of solid state NMR to pursue the structure determination of the protein targets.

Andrzej Joachimiak
Midwest Center For Structural Genomics

Dr. Andrzej Joachimiak from Argonne National Laboratory, Principal Investigator for the Midwest Center for Structural Genomics (MCSG) presented a variety of technologies being applied within the MCSG pipeline to address key bottlenecks and improve gene to structure productivity. Dr. Joachimiak noted that gene to structure pipelines show attrition at all steps. In order to effectively pursue the structure determination of chosen targets within designated protein families, the MCSG will select several orthologues of a protein target from of a variety of different "reagent" genomes. By choosing related protein targets from different organisms, numerous protein variants are effectively pursued. In addition the MCSG has found that a limited number (<5) of carefully designed constructs of each chosen target can improve soluble protein expression from 24% when only single pfam domain constructs are examined, to 45% when an additional 4 different "domain expanded" (lengthened N-terminally or C-terminally to include any contiguous predicted secondary structure) constructs are examined. The combination of orthologues and domain expanded construct engineering for selected targets requires significant pipeline capacity in construct engineering and expression testing. MSCG has simplified construct engineering and transfer between different expression vectors by making use of ligation independent cloning (LIC) to clone desired target genes into a Gateway® transfer vector which can be used to efficiently move target genes to a variety of bacterial, eukaryotic, and baculoviral expression vectors.

The MCSG has developed a special pMCSG19c vector system to produce protein targets fused at their N-termini to Maltose Binding Protein (MBP), wherein the linker region contains a site specific TVMV protease cleavage site, six histidines, and a site specific Tev protease cleavage site. The plasmid co-expresses the TVMV protease at the same time as the MBP-Target is expressed. Hence the in vivo proteolytic removal of the MBP during induced protein expression results in the production of His6-Tev-Target protein which can be purified by immobilized metal chelate chromatography (IMAC). Removal of the His6-tag is achieved by on-column cleavage wherein automated chromatography protocols are used to load cell extracts with the target protein followed by delivery of a GST-Tev protease fusion to the column, which sticks to the column but has a sufficient on-off rate so that it is effectively free to cleave the target protein. After on-column incubation with the GST-Tev protease, the released Target protein can be simply eluted in a wash buffer leaving any uncleaved fusion protein and the GST-Tev protease behind still bound to the column and can be eluted later in a column recharging step.

Other methods used by MCSG to pursue difficult targets include reductive methylation to improve crystal growth with a 6-8% success rate, limited proteolysis and LC-MS to define stable protein domains, oxidative refolding to recover properly disulfide bonded and folded protein from insoluble inclusion bodies, and the production of glycosylated secreted proteins from the baculovirus/insect cell system. The MCSG has also begun to pursue eukaryotic proteins from nematode and yeast using recombinant baculovirus expression systems. Co-expression vectors have been set up for possible co expression of up to 5 different proteins. Now testing co-expression of various proteins in the vectors. The MCSG is also now mixing proteins based on biological information to try to solve the crystal structures of protein-protein complexes.

The MCSG solves 95% of its structures by SAD or MAD methods using crystals produced from SeMet labeled proteins in the majority of cases, and to a lesser extent bromine bound crystals produced by soaking with NaBr containing buffers (used to solve structures of domains with no encoded methionines). The MCSG has pioneered the use of modified minimal "Pink" medium (now commercialized as Medicilon HY medium) for efficient SeMet labeling and high-level production of recombinant proteins expressed in E. coli. The MSCG has analyzed the crystallization success of targets in comparison to their calculated grand average of hydropathicity (GRAVY) and pI, based on over 1,600 proteins successfully crystallized. The data suggest a strong tendency for crystallizable proteins to fall into specific regimes with a general trend towards low pI (4.5 to 7.5) and a slightly hydrophilic nature with GRAVY score of between -0.8 and 0.1(positive number is hydrophobic). This information is being used to help pre-select targets and design constructs for successful crystal structure determination

The efficiency of the MCSG pipeline to produce X-ray diffraction data sets for prepared crystals has generated a bottleneck between X-ray diffraction data collection and structure determination. To address this bottleneck, MCSG researchers have developed the HKL3000 software package which allows the fully automated structure determination from raw diffraction data. The software includes special algorithms for processing diffraction data from very badly behaved crystals where as few as 1 out of 4 diffraction spots are useful. In addition, the HKL300 software is now integrated with SHELEX and SOLVE/RESOLVE to allow automated heavy atom structure solution, calculation of electron density maps, model building, map fitting and refinement. Sometimes ligands are found bound to proteins as electron density whose chemical structure is not known. MCSG has developed a Global Protein Surface Survey server software package to help identify the nature of such ligands found in undefined electron density of protein crystal structures. In the last 2 years of PSI-2, the MCSG has solved more structures than in all 5 previous years in PSI-1.

Scott Lesley
Protein production and crystallization at the Joint Center for Structural Genomics

Dr. Scott Lesley from The Scripps Research Institute and the Joint Center for Structural Genomics (JCSG) presented a number of technologies being applied as salvage pathways to address key bottlenecks for difficult targets within the JCSG gene to structure pipeline. Target selection for the JCSG has focused on solving structures for novel protein families, including protein targets derived from the Global Ocean Survey metagenomic project. The JCSG is deploying smart target selection based on data mining of success factors from their PSI-1 campaign. Calculable physical factors that seem to be positively correlated with gene-to-structure success are; low molecular weight, low Pi, high UV absorbance which is related to the content of aromatic amino acids (Tyr, Trp, and Phe), high SeMet phasing power which is related to Met content, low Cys content or disulfide bond complexity, low predicted disorder, the availability of genomic DNA, and a low number of predicted transmembrane helicies. Targets with these features are more likely than other targets to result in a successful crystal structure within a high-throughput environment with iterative processing. Within this selected group however, there are targets that still resist structure determination, and the JCSG is deploying a variety of so called "salvage" pathways to address such difficult targets.

One of the most important salvage approaches at JCSG is to deploy the use of Polymerase Incomplete Primer Extension (PIPE) to rapidly generate numerous constructs with N- and C-Terminal truncations as well as surface residue mutations. The mutagenized constructs are subject to a parallel microscreening procedure where 1 ml scale cultures in 96 well blocks are processed to determine soluble protein levels by small scale batch immobilized metal chelate chromatography. Eluates are subject to quantification, dynamic light scattering, LC-MS, and analytical size-exclusion chromatography (AnSEC) which is used to identify truncations that improve the aggregation profile of the protein. H/DXMS studies are now routinely used by JCSG to identify regions of protein disorder to inform construct design for microscreening. Reductive methylation is also used to chemically modify proteins for improved crystallization with reasonable success rates that justify its use as a standard salvage pathway at JCSG.

In addition to these proven salvage pathway methods, the JCSG is continuously testing new potentially useful salvage methods. Including high pressure refolding to recover insoluble protein. However an analysis of 35 insoluble proteins revealed that only 6 of the samples could be refolded into a mono-disperse soluble form, none of which produced crystals. Since the high pressure refolding was no more effective at recovering insoluble protein than more traditional use of chaotropic agents followed by buffer exchange or dilution, the JCSG has abandoned high pressure refolding as method of salvage. In contrast, the use of NMR to look at the spectrum of panels of ligands in the absence and presence of target protein is proving to be a powerful way to identify ligands that can help stabilize protein samples and promote their crystallization as a ligand complex. The JCSG is also using mechanistic based covalent chemical modifiers to generate chemical adducts of proteins, which may aid in their crystallization and also serves to help identify enzymatic function for targets of unknown function. Dr. Lesley noted that 30% of JCSG pipeline is currently dedicated to the salvage pathways, wherein 70% of difficult target structures can be resolved with one or more salvage approach.

Chang-Yub Kim
Development of a High-Throughput Ligand Identification Technique and Application to Structural Genomics at the Integrated Center for Structure and Function Innovation

Dr. Chang-Yub Kim of the Los Alamos National Laboratory and the Integrated Center for Structure and Function Innovation (ISFI) presented a number of synergistic technologies being applied within ISFI to overcome recognized bottlenecks in structure determination at the key steps of production of soluble protein and protein crystallization. To improve the success rate in crystallization, ISFI researchers have established an affinity based screen to identify ligands that bind to target proteins, and then using the ligands as additives to promote improved crystallization of the proteins. In this method, proteins (pure or in crude cytosolic extracts) are bound to an Affi-gel blue dye affinity resin, and then are eluted with a series of defined ligands. About 50% of the proteins, including many nucleotide-binding proteins, bind to this resin. Protein-ligand interactions are then identified by the ability of various ligands (natural and xenobiotic) to promote the elution of proteins from the Affi-gel. The eluted proteins are identified by 2D gel electrophoresis and mass spectrometry-based fingerprinting of tryptic peptides and matching to genomic databases. The ISFI has used this procedure to identify thirteen proteins from Mycobacterium tuberculosis (Mtb) that were selected for cloning and crystallization based on elution of total cell extracts bound to the affinity resin. The ISFI has solved the structure of one protein that was identified based on the ligand identification approach (Rv0223c, a putative aldehyde dehydrogenase) with a bound nucleotide, and have generated a series of crystals of this protein with nucleotides having related structures. Ligands that promote elution of defined proteins are being included as additives in crystallization trials with these target proteins so as to promote the crystallization of protein-ligand complexes whose structure determination would be highly relevant as a starting point for structure based drug design. The ISFI how has several examples of native proteins that were not crystallizing but then were able to get crystals with added ligand once the ligand specificity was known. In this regard, the affinity gel elution procedure is also being used with drug candidates for treatment of tuberculosis to elute proteins of potential therapeutic relevance. Once such proteins are identified, they are then produced and subjected to co-crystallization trials with the drug candidates used to elute the proteins from the Affi-gel. The same procedures are also being applied to aid in the purification of the proteins of interest.

Takahashi Yabuki
RIKEN Structural Genomics/Proteomics Initiative

Dr. Takahashi Yabuki of the RIKEN Institute in Yokohama JAPAN presented an overview of the technologies applied to structural genomics research conducted within the RIKEN Structural Genomics Initiative (RSGI) under the nationally funded Protein 3000 Project that was conducted from April 2002 to March 2007. During the complete term of the project, over 2,500 protein structures were solved, half by NMR using the worlds largest NMR farm and half by X-ray crystallography making use of the Spring-8 synchrotron radiation facility. Almost all of the NMR structures were of human and mouse protein domains, produced in cell free translation systems. A two step PCR method has been developed for rapid preparation of template DNA used for in vitro transcription reactions to produce mRNA for the cell free translation systems. Designed primers are used to amplify open reading frames from cDNA encoding target protein domains. In a second step of PCR, pre assembled 5' and 3' sequences are amplified together with the target ORF such that the final PCR product contains a 5' Promoter, an initiator ATG followed by an encoded N-terminal affinity tag, the ORF, C-terminal tag and GFP reporter with Stop codon, and finally a transcription terminator. The amplified templates are stored in 2D barcoded vials in a robotic Automated Cryogenic Clone Storage System (ACCESS) where they can be rapidly accessed for use in in vitro transcription reactions to produce mRNA template for automated cell free protein production. RIKEN has made significant investments in automation of almost all steps of gene to protein and crystal preparation. In order to assess the folded nature of proteins produced in the cell free protein production system. RIKEN makes use of the C-terminal GFP reporter system developed by Geoffrey Waldo and colleagues at Los Alamos National Labs and ISFI (Nature Biotechnology, 1999, Vol. 7:691-695). When the protein of interest fails to fold properly, the folding of the GFP reporter is also blocked. Hence, measurements of fluorescence in cell free protein production provides a good estimate of the amount of soluble protein produced. RIKEN has made a large investment in the development of various scale dialysis systems for protein sample preparation after cell free production and metal chelate affinity chromatography purification. Sample dialysis efficiency is important for preparing samples for NMR studies in the high throughput RIKEN pipeline. In addition to heavy automation, RIKEN has also developed an extensive database system for the tracking and data storage of all gene to structure research data and workflow. The database system allows for the definition of both fixed and flexible workflow tracking. RIKEN researchers have also used synthetic biological approaches to develop modified tRNA synthetases that will produce Iodo-Tyrosine charged stop codon suppressor tRNAs for specific incorporation of Iodo-Tyrosine into proteins for the purposes of heavy atom phasing of protein crystal structures. Protein crystallization has been automated in the submicroliter drop volume in under oil crystallizations, and in the nanovolme scale using the Fluidigm microfluidic crystallization devices as well as the TTP Labtech Mosquito robot for sitting drop crystallization trials. The cell free protein production system is also being adapted to the production of Zn++ binding enzymes where Zn++ is included in the translation mix, and also for integral membrane proteins wherein detergents or membranes are included in the cell free protein translation mix. In 2007, RIKEN will launch a new structural proteomics initiative with focus on protein-protein complexes and protein targets of biomedical importance.

Monday March 19, 2007 1:00 PM to 4:00 PM

Session 2 Bottlenecks and Technology for Structural Genomics

Robert Stroud
The Specialized Center for Structure of Membrane Proteins

Dr. Robert Stroud of the University of California, San Francisco, PI for the PSI2 Specialized Center for Structure of Membrane Proteins (CSMP), presented a series of technologies being used within the CSMP to produce membrane proteins for structural elucidation. Members of the CSMP have selected over ~170 integral membrane protein targets (with greater than 3 predicted transmembrane spanning regions) from E. coli, P. Auriginosa, thermophilic archea, human cellular, and human mitochondrial genomes. Cloning of the target ORFs into T7 inducible E. coli expression vectors is mediated by ligase independent cloning (LIC) wherein the protein is encoded with a TEV protease removable His6x tag or a Thrombin removable His6x-MISTIC tag. Several LIC vectors for expression of proteins in yeast species have also been developed. The CSMP has also used gene synthesis to carry out re-design of genes from the malarial parasite Plasmodium falciparum, so that they have improved codon usage which is better matched with the intended E. coli expression system. This synthetic gene approach has been used to successfully express, purify and crystallize a Plasmodium Aquiporin. In addition to in vivo expression of membrane proteins in E. coli, the CSMP has also used cell-free in vitro protein production from E. coli cell extract that is prepared in-house. A comparison of in vivo and in vitro cell-free protein production revealed that the two methods are largely complementary with respect to successful protein production. Membrane protein complexes are also being pursued at CSMP and recently resulted in the structure determination of Amt-GlnK structure. Co-expression of proteins in E. coli is being facilitated by a re-engineered DUET vector system that encodes affinity tags on both expressed proteins. Of the ~170 integral membrane proteins being pursued by the CSMP, ~60 of these are considered to be high priority targets, of which 9 are currently have been successfully purified to a pure homogeneous state, and an additional 8 have been crystallized.

Michael Malkowski
The Center for High-Throughput Structural Biology

Dr. Michael Malkowski of the Hauptman-Woodward Institute discussed the technologies under development within the PSI2 Specialized Center for High-Throughput Structural Biology (Dr. George DeTittia, PI). The crystallization laboratories of the Hauptman-Woodward Institute have an efficient service crystallization core that will screen proteins in 1534 conditions in microbatch under oil. Images of the resulting crystallization drops sent electronically with e-mail notification to customers. Collaborator Dr. Igor Jurisica at the Ontario Cancer Institute is developing image analysis algorithms to automatically classify images of crystallization drops, and is in the process of their image-analysis system into a distributed system to be run on the World Community Grid. Dr. Joseph Luft at the Hauptman Woodward has established a simple protein :crystallant in drop volume ratio and temperature (DVR/T) screening method that provides an efficient method for screening crystallization phase space. The Hauptman-Woodward team is also engaged in the development of simple pipette tip microcapillary crystallization system in collaboration with Dr. Mike Soltis at the Stanford Synchrotron Radiation Lab, with the goal of routine in situ X-ray data collection from cryopreserved capillary grown crystals. Collaborators at Cornell University, Drs. Yi-Fan Chen and Sol Gruner have established methods for high pressure noble gas cryopreservation of crystals through the formation of high density amorphous ice at 200 MPa which overcomes the barrier of ice formation and crystal damage often observed in standard cryopreservation methods at atmospheric pressures. Collaborators at the University of Irvine, Dr. Alexander McPherson, have identified a number of small molecules (both natural and synthetic) with general utility for promotion of protein crystal growth. Such crystal promoting small molecules have common architectural features of a small carbon ring or branched carbon core with various appendages that tend to be strong hydrogen bond donors or acceptors such as carboxylate, sulfonate, phosphate, or nitrate that can effectively compete with water molecules for binding to protein surfaces. Collaborators at the University of Rochester, Elizabeth Grayhack, Mark Sullivan, Mark Dumont and Kathy Clark are respectively engineering yeast S. cerevisiae for the efficient production of Selenomethonine labeled proteins, transmembrane proteins, and single chain antibody Fv domains as protein affinity reagents, with the overall goal of using the growth potential of yeast to produce significant quantities of recombinant stable transmembrane protein for structural studies.

John Markley
Technology Development at the Center for Eukaryotic Structural Genomics

Dr. John Markley of the University of Wisconsin, PI for the PSI2 Specialized Center for Eukaryotic Structural Genomics (CESG), presented the active pipeline research on the structure determination of eukaryotic proteins by both NMR and X-ray crystallography. The CESG is currently producing ~30 new protein structures per year, with a concurrent focus on technology developments to improve the economics of structure determination, including the extensive use of cell-free protein production using robotic wheat germ and E. coli cell free extracts for protein synthesis. Automation systems in use within the CESG include the Mosquito robot and microfluidic Fluidigm systems for crystallization, the Crystal Farm for crystal imaging, the Caliper LC-90 for capillary gel electrophoresis to analyze protein samples, the CFS Protemist DTII for automated cell free protein production in wheat germ extracts, and the Maxwell system for automated magnetic-bead affinity purification of proteins. The CESG has continued to develop its freely available SESAME laboratory information management system (LIMS) to track all aspects of the CESG gene-to-structure pipeline activity, including target listing in the PSI TargetDB and the deposition of extensive reporting information into the PSI Protein Expression, Purification and Crystallization Database (PepcDB). The CESG makes extensive use of the Flexi®Vector plasmids to enable modular vector design with facile transfer of open reading frame gene segments into various different vectors for expression of recombinant fusion proteins as maltose binding protein (MBP) fusions. The SESAME LIMS system informs the CESG team of the various stages reached in the gene-to-structure process. CESG researchers have used Graphviz software to help visualize data connectivity and dependency within the SESAME LIMS, which has helped to speed the development of data reporting tools for the PepcDB. Factorial design of experiment studies have been used to improve bacterial culture media with variable lactose and glycerol additives for greatly improved protein production following IPTG induction of Lactose promoter mediated protein expression in E. coli. CESG researchers the laboratory of Dr. George Phillips have developed an new method for amino acid assignments in protein structures from weak (3-4 Angstrom) X-ray diffraction data. This ACMI, for "Automatic Crystallography Map Interpretation", approach uses probabilistic assembly of overlapping pentapeptide models from the PDB to aid in electron density map interpretation. In NMR, the CESG has developed probabilistic tools for fast data collection and for automated processing and analysis of NMR data (HIFI-PINE). The CESG 2007 Wheat Germ Cell-Free Protein Production Workshop will be held on July 22-25, 2007 at the CESG labs in Madison Wisconsin.

Lance Stewart
ATCG3D: The Accelerated Technologies Center for Gene to 3D Structure

Dr. Lance Stewart, PI of the Accelerated Technologies Center for Gene to 3D Structure introduced three key technologies being developed within the collaborative framework of the ATCG3D Specialized PSI2 Center. (1) Lyncean Technologies, Inc. (Palo Alto, CA), under the direction of Dr. Ronald Ruth, is constructing a tunable Compact X-ray Light Source that fits in a small laboratory, and will eventually be installed at The Scripps Research Institute (TSRI) in the ATCG3D laboratories of Drs. Peter Kuhn and Raymond Stevens. (2) Nanovolume plug-based microcapillary crystallization technologies are being developed for both soluble and membrane-bound proteins at the University of Chicago under the direction Dr. Rustem Ismagilov. deCODE biostructures (Dr. Lance Stewart PI, Bainbridge Is., WA) and Micronics, Inc. (Dr. Fred Battrell PI, Redmond, WA) are collaborating with the Ismagilov and TSRI labs to develop inexpensive plastic labcard implementations if microfluidic circuitry for plug-based nanovolume crystallization and in situ X-ray diffraction data collection. (3) deCODE biostructures is developing Gene Composer software tools for computer-aided protein construct engineering (Alignment Viewer and Construct Design Module) and design of "expression optimized" synthetic gene sequences (Protein-to-DNA Module) that can be assembled by PCR methods from overlapping oligonucleotides that are also designed by the Gene Composer software (Gene-to-Oligo Module). deCODE and TSRI have conducted a synthetic gene design study on human GPCRs, comparing the total protein production achieved in E. coli from synthetic gene variants (designed by Gene Composer and DNA 2.0) versus cDNA when expressed as MISTIC-GPCR fusion proteins. The overall MISTIC-GPCR protein production achieved from synthetic gene variants was similar to that achieved with cDNA, suggesting that MISTIC-GPCR production is relatively unaffected by codon usage or any other nucleic acid sequence feature. Depending on the GPCR used, production of MISTIC-GPCR fusion protein ranged from ~5-0.5 mg/liter, with most of the protein being insoluble. Synthetic gene production facilitates gene isolation and has a high chance of being as good as cDNA with respect to protein production. Synthetic gene design with Gene Composer includes codon optimization through the use of codon usage tables specific for multiple different expression hosts (baculovirus, E. coli, etc.), silent restriction enzyme site insertions or deletions, incorporation of "ambush" stop codons, removal of cryptic Shine-Dalgarno sequences, deletion of "slippery" repeat sequences, and others. The end product is an expression optimized gene with a unique sequence that should make cloning and protein expression more efficient. The synthetic genes can be purchased from a number of different vendors (DNA 2.0, Codon Devices and others), or can be produced by PCR methods from overlapping oligonucleotides, which deploy mismatch surveillance endonucleases that to eliminate any PCR strands with mistakes by cleaving heteroduplexed strands at sites of mismatch. Free demonstration versions of Gene Composer can be downloaded at www.genecomposer.net. In addition the ATCG3D is hosting on-line workshops for synthetic gene design, with registration and video recordings of the workshops registration located at genecomposer.net/workshop.

Stephanie Mohr
The PSI Materials Repository: Plans and Early Implementation

Stephanie Mohr, Director of the PSI Materials Repository (PSI-MR) at the Harvard Institute of Proteomics (Harvard Medical School, Cambridge, MA, Dr. Joshua LaBaer, PI), presented the mission and mode of operation at the newly formed PSI-MR which will serve as a centralized storage, maintenance & distribution center for all plasmid clones produced by PSI researchers. The Plasmid Information Database PlasmID ​​developed at Harvard, facilitates the on-line search and request of plasmid clones within the PSI-MR. Clone samples are processed in a highly automated, standardized way and stored in a state-of-the-art automated freezer storage system with 2D barcode tracking of all samples. The PSI-MR will sequence validate every clone that it receives from PSI researchers and has established mechanisms for handling any discrepancies of information before any given clone is allowed into the PSI-MR. The plasmids are prepared as frozen DNA samples and also transformed into bacteriophage resistant E. coli hosts for long term frozen glycerol stock storage. Sequence validation of clones is managed by ACE Software for Automated Clone Evaluation which can design primers for sequencing and can identify any sequence differences observed between expected vs. determined sequences. The PSI-MR has established numerous standard operation procedures (SOPs) for the entire informatics system and physical storage operations, including data archival (on-site and off-site). The PSI-MR will allow researchers to request plasmid orders through an on-line store interface that handles all clone requests. Presently the PSI-MR is working to establish blanket Material Transfer Agreements that will serve as the uniform legal documents allowing transfer of materials from PSI related institutes (both academic and commercial) to the Harvard Proteomics Institute. The Material Transfer Agreements are also designed to cover the legal aspects to allowing transfer of materials out of PSI-MR to other third party institutions that request the samples via the on-line request system. The informatics system can handle various constraints or limitations that may be set for possible intended use of the materials ( i.e. certain restrictions may differ with respect to sample transfer allowances for non-profit vs. for-profit institutions). The PSI-MR currently anticipates the need to store more than 100,000 PSI clones in over ~120 different vector backbones. The informatics system of the PSI-MR can handle the open reading frame gene sequence separate from the vector sequence in a database architecture that defines parent-child relationships between clones. The PSI-MR hardware includes two robotic -80C freezers located next to a colony picker, with a current capacity for 160,000 samples. An associated robot arm can handle plates, tubes, and various other sample formats that are all stored on a -20C platform. Automated protocols handle all aspects of sample storage entry and retrieval.

David Stuart
Structural Proteomics in Europe

Dr. David Stuart of the Welcome Trust Center for Human Genomics (Oxford, UK) presented an overview of structural proteomics in Europe. There are several collaborative protein structure determination efforts in Europe. The Pan-European Framework 6 Programme (FP6) funded efforts in structural proteomics include: SPINE-2 (Structural Proteomics in Europe 2, several PIs focusing on protein-protein complexes coordinated by Dr. David Stuart at the University of Oxford, UK); VIZIER (several PIs led by Dr. Bruno Canard focused viral replication machinery www.vizier-europe.org); BioXHIT (Biocrystallography on a Highly Integrated Technology Platform for European Structural Genomics coordinated by Dr. Victor Lamzin, focused on improvements in crystallization, automated data collection and structure determination at European synchrotrons, www.bioxhit.org); eMEP (European Membrane Protein Consortium, coordinated by Dr. Roslyn Bill and focused on improvements in membrane protein structure determination); 3D repertoire (several PIs coordinated by Dr. Louis Serrano focused on protein complexes. The Structural Genomics Consortium (SGC) is a Euro-Canadian public-private (industrial) operation managed by Dr. Aled Edwards at the University of Toronto. The SGC is focused on the structure determination of medically relevant families of human proteins and other malarial parasite targets. In the UK there are two SPoRT projects (Structural Proteomics of Rational Targets). The Membrane Protein Structure Initiative (MPSI, www.mpsi.ac.uk) is led by Dr. Neil Isaacs at the University of Glasgow. The multidisciplinary Scottish Structural Proteomics Facility established at the Universities of Dundee, St. Andrews, Glasgow, and Warwick is focused on bacterial proteins involved in the biosynthesis of natural products and proteins involved in viral replication. The Switzerland NCCR effort on ABC transporter membrane proteins is led by Dr. Markus Grütter.

PIMS and eHTPX are two important integrated data management activities in Europe. The Protein Information Management System managed by Dr. Robert Esnouf (robert.esnouf@strubi.ox.ac.uk) is a commercial-quality freely available software package capable of storing data on numerous type experiments (PCR, cloning, expression testing, crystallization, etc.) with defined protocols and plate definitions. A PIMS collaboration with BioXHIT is enabling the sharing of crystallization trial information and associated images. The e-Science Resource for High Throughput Protein Xtallography (eHTPX) managed by Dr. Ian Berry (ian.berry@strubi.ox.ac.uk) which aims to create XML standards for communicating protein crystallography information with standard interfaces, allowing uniform transfer of information between structural biology laboratories and synchrotrons.

The gene-to-structure technology improvements from SPINE have been published in the October 2006 issue of Acta Crystallographica D Vol. 62, Section 10. Technologies that have proven to be effective and popular in Europe include: ligase independent cloning (LIC), synthetic genes (e.g. for viral proteins), identification of domains by selection of well expressing clones from gene fragment libraries, small scale expression testing, nanovolume crystallization (Fluidigm and Cartesian systems as well as microcapillary methods), protein refolding from inclusion bodies, reductive methylation of proteins, microseeding of crushed crystals in nanovolume crystallizations, thermofluor screens for identification small molecule ligand-protein interactions that help establish protein function and improve crystallization, the development algorithms to design targeted crystallization screens for classes of proteins.

Other notable technology developments in Europe include the following. (1) The SPINE-2 effort to identify a model thermophilic vertebrate organism as a source of thermostable vertebrate protein targets, SPINE-2 collaborators have selected the thermotolerant annelid worm Alvinella pompejana, and have conducted an expedition to the hot springs in Australia to collect fish that can withstand 55C. (2) The Oxford Protein Production Facility makes us of an in-house modified In-Fusion® vector suite to easily move ORFs between vector plasmids for E. coli, baculovirus-insect cell, and mammalian cell expression. Baculovirus-insect cell expression is routinely used in structural proteomics in Europe, and systematic co-infections are being used to identify combinations of proteins that form stable soluble complexes. Transient mammalian cell expression in HEK293T cells has been automated and can be high yielding for extracellular glycosylated proteins which are de-glycosylated with EndoH before being set up in nanovolume crystallization experiments. (3) Two algorithms for disorder prediction, RONN and FoldIndex, are being used to guide construct design. (4) In some cases, large scale libraries of tens of thousands of N- and C-terminal truncations (generated by exo-nuclease digestion of DNA ends) are screened for soluble expression by incorporation of an N-terminal biotinylation tag which is biotinylated in E. coli by biotin ligase. Soluble protein expression is detected by high density colony spotting on agar plates, followed by in situ detection of the presence of biotinylated peptide tag from soluble protein, by immunoblotting.

Aled Edwards
EuroCanadian Structural Genomics Consortium

Dr. Aled Edwards, CEO of The Structural Genomics Consortium (SGC) and Professor at the University of Toronto, presented the goals of the SGC to make protein structural information available to the public through a pre-competitive public-private funding mechanism where targets of biomedical interest are selected for structure determination by the SGC scientific advisory board. The SGC is focused on solving the structures of multiple human targets belonging to protein families of medical importance (90% of targets) including some membrane proteins, as well as targets from the malarial parasite Plasmodium falciparum (10% of targets). The SGC has research activities in Canada at the University of Toronto, in the UK at Oxford, and in Sweden at the Karolinska Institute. The SGC has ~170 staff globally, and is currently ahead of schedule for production of protein structures, having deposited 440 structures into the PDB since July 2004, and is currently running at an average of ~190 structures per year.

The SGC efficiency in structure output is the result of the application of both protein crystallography and NMR as methods for structure determination, as well as the use of multiple techniques proven to be of utility for structure determination of difficult eukaryotic targets. In addition, the SGC has set quantifiable target goals for structure production at rates that require the development of technologies in order to successfully achieve the goals. Such technologies include the heavy use of ligands within human enzyme classes (i.e. kinases) where one or more targets is a known drug target. For example, the SGC has assembled a library of known kinase inhibitors for use in differential scanning fluorimetry (FluoDia instrument from PTI) and differential light scattering (StarGazer instrument from Harbinger Biotech) to "fingerprint" identify ligands that stabilize kinase targets. Most kinase structures (13/17) crystallized and solved with a ligand bound. Similarly, the SGC has applied small molecule libraries of nucleotide derivatives, metals, co-factors, amino acids, buffers, and other selected small molecules to help identify protein-ligand interactions that can be exploited for improved protein stability, crystallization, and concentration for NMR studies. Application of medicinal chemistry to provide libraries of ligands for co-crystallization with enzymes belonging to the same target family allows economies of knowledge and scale to improve structure output. Approximately 21% of all SGC structures are the result of functional knowledge of the target families including kinases, phosphatases, histone deacetylases, cyclophylins, oxidoreductases, methyltransferases, sulfotransferases, ubiuitin ligases, oxygenases, and others. Protein-Protein co-complexes have also been successfully pursued by the SGC using co-expression to identify protein partners (i.e. SOCS2-ElonginB/C). The SGC is organized into teams of 8-10 people that work on a basket of ~300 targets each, and have an overall success rate of ~20%.

The SGC usually examines 6-9 different terminally variable constructs per target and also uses rational surface mutagenesis in a limited fashion with a maximum of 3 sets of surface mutations per target. The SGC has used synthetic genes to improve expression of malarial proteins in E. coli, and reductive methylation to improve crystallization of some targets. A generalized protocol for gene-to-protein work at SGC involves the use of ligation independent cloning of ORFs into vectors with His6x tags that are removable with TEV protease for expression in E. coli, P. Pastoris, and the baculovirus-insect cell system. Small scale soluble expression testing is used to select clones for further protein production wherein purification is carried out by immobilized metal affinity chromatography (IMAC), gel filtration, and sometimes ion exchange chromatography.

Dr. Edwards is writing a review article for Nature Methods, with the intent to summarize consensus best practices in protein cloning, expression, and purification that have been established by global structural proteomics efforts. He has requested the input from PSI, RIKEN and SPINE operations to aid in the preparation of this review article.

Tuesday March 20, 2007 8:00 AM to 9:30 AM

Session 1 Promoted Abstracts, Technologies for Bottleneck Solutions

Wayne Boucher
The CCPN Data Model and Software for High-Throughput NMR

Dr. Wayne Boucher of the University of Cambridge discussed the CCPN, Collaborative computing project for NMR, which is open source software for NMR similar to CCP4 for X-ray crystallography. Currently, NMR software consists of many different programs with no data unity and no basis for task or code sharing; for example, there are various PDB parsers. The aim of CCPN is to have a Data Standard to ensure there is no loss of data when transferring between programs and harvesting data, and to ensure completeness and integrity of data. The database will contain UML, Unified Modeling Language, with APIs that implement the model, which is format independent and makes it easy to maintain as a model changes. The database will also support several programming languages including XML, Python and SQL. There will be automatic code generation from data object models, and data consistency will be ensured (i.e. if a bond is present there must be two atoms). CCPN will be a user friendly program that will give consistency to NMR data and validation, and it will be easy to maintain and modify. Finally, CCPN will be collaborating with the X-ray crystallographers.

Shohei Kiode
Protein Engineering Pipeline for Chaperone-Assisted Crystallography.

Dr. Shohei Kiode discussed chaperone-assisted crystallography. The chaperones are not heat shock proteins but are in vitro evolved and selected molecules that bind to protein targets that you would like to crystallize. These chaperones can be used to bind to protein target and the chaperone may carry heavy atoms used for phasing. One such example was by McKinnon's group, who bound Fab (antibody) to a membrane protein ion channel. Hybridomas are slow to produce, so the project innovation is the use of phage display with in vitro selection to select the best binder to the target. Fab fragments, cameloid antibodies and fibronectin are utilized as scaffolds. Each protein scaffold has loop regions that are being mutagenized with a two amino acid code library (Try and Ser) ~ 10e10 member library for each static scaffold. One structure of a Fab fragment bound to a protein of interest shows that the protein-protein is predominately Try-rich. The process from protein of interest to production of high affinity antibody (Kd < 1nM), and testing the protein-protein interaction takes approximately four weeks. The targets are those that won't crystallize from the large structural genomics centers and also includes membrane proteins and structured RNA. Dr.Kiode gave a high hanging fruit example for proof of concept. They crystallized a human growth hormone and receptor complex with Fab that bound and stabilized the complex. Also, a membrane protein collaboration has resulted in a crystal structure of KcsA full length protein including C-terminal regulatory domain bound to Fab. The structure of the full length KcsA was previously unknown. Also, a structure of functional RNA complexed to Fab has been solved, and one observes beta strand bridging that promotes crystal packing. This is the first structure of RNA bound to Fabs. Finally, the group produced a fibronectin fragment that binds with high affinity to MBP. These two proteins were fused to promote self-assembly. Once again the crystal structure of the complex revealed a Tyr-rich protein-protein interface. The bottlenecks for the chaperone-assisted crystallography is the Fab production that required trained personnel for this step, otherwise the process is relatively high-throughput.

Yi-Fan Chen
Automated High-Pressure Cyroprotection and Noble Gas Phasing for High-Throughput Protein Structure Determination.

A major bottleneck in high-throughput protein structure determination involves cryoprotection of protein crystals. Cryopreservation is used to reduce radiation damage, but different crystals require different cryoprotectants, or for some proteins won't withstand the cryoprotection. Dr. Yi-Fan Chen, from Cornell University, described a technique called high-pressure cryoprotection (HPC), which does not require the standard cryoprotectants. To carry out HPC, the crystals are initially pressurized to 200 MPa with He followed by cooling to 77 K, then the pressure is reduced and the crystals may be stored in liquid nitrogen. Currently the HPC system Dr. Chen has set-up is the size of a refrigerator. Two of the main advantages of this procedure are 1) Improvement of diffraction without the use of cryoprotectant, and 2) Incorporation of nobel gases during HPC for heavy atom or Anomalous scattering phasing. Dr. Chen showed an example of glucose isomerase for improved diffraction with HPC. With standard cryoprotection without addition of cryoprotectant, the crystals diffracted poorly to 5 Å, and were mosaic with water rings. After using HPC, the diffraction of the crystals improved to 1.1 Å with low mosaicity. Dr. Chen also described a modified procedure to combine noble gases with HPC. If one is using Xe gas, one must initially pressurize to 1MPa for 15 minutes, before releasing the pressure and then repressurizing to 2 MPa with He gas, then the crystal is cooled and the pressure released. Dr. Chen speculated that HPC either decreased the cooling rate required to vitrify water or the pressure promoted the formation of high density amorphous ice (HDA). HDA ice diminishes damage due to ice expansion upon cooling. Dr. Chen showed evidence that HPC works through HDA by observing diffraction limits and ice-rings at varying temperature after HPC. They are now working to try to handle 96 samples at once for HPC. One solution mentioned was crystallization in capillaries so that one may have all 96 samples together without dehydration and this method would require minimal manipulation.

Mark Dumont
Production of Yeast Transmembrane Proteins for Structural Genomics.

Dr. Mark Dumont of the University of Rochester pointed out the severe lack of 3-dimensional structural information for eukaryotic transmembrane proteins in the PDB. Moreover, the integral membrane proteins whose X-ray crystal structures have been solved were mostly derived from natural tissue or bacterial sources, with few successful examples of heterologously expressed membrane proteins leading to structure determination. To address this challenge, Dr. Dumont has applied the power of yeast genetics and biochemistry to develop a homologous yeast expression system for yeast transmembrane proteins, which will ensure proper protein-protein interactions, protein localization, function and expression levels. The target proteins had to have at least 2 or more predicted transmembrane segments, and the ORFs were not part of a hetero-oligomeric complex. Also targets were chosen from a list of 263 predicted membrane proteins that were highly expressed (J. Weissmann Nature 2006 441:840 and Von Heinje 2006 PNAS 103:11148). Most of these genes were already in a library of 6,000 yeast clones in the gateway vector system. The genes of interest were moved by LIC into vectors that included LIC sites, C-terminal ZZ-domain, His10 tags, and a 3C protease cleavage site. The proteins are overexpressed by autoinduction where the glucose is monitored and when it is depleted, galactose is added. The cells are harvested and lyzed with Avestin Emulsiflex C3 electrically driven high pressure homogenizer, and cell membranes are harvested by centrifugation. Following a high salt wash (1.2M KCl) of membranes, the transmembrane protein is solubilized in detergent before running over it an affinity column and performing on-column removal of the affinity tag by 3C protease. Dr. Dumont's team purifies approximately 1.7 mg of transmembrane proteins from 10-L of culture, so they are trying to improve protein yield by improving various steps. For example, they are lowering the temperature of protein expression to prevent proteins going into inclusion bodies. Also, growth in rich medium sometimes causes loss of plasmids, so they are testing various medias and utilizing Avestin Emulsiflex homogenizer to obtain completely lyzed cells. They also strip membranes as stringently as possible before detergent solubilization in different detergents to obtain one conformation. They find that stripping helps get more protein and less aggregation. They also feel that may be proteases are associated with membranes that co-purify in IMAC. They utilize a strain that has Pep4 protease mutation that removes the protease that activates other proteases. PMSF is needed to inactivate serine proteases but is not 100% effective. So they use double Pep4 and Prb1 mutants and a PMSF inhibitor to get protein off of Talon column. It takes relatively large amounts of 3C protease to cleave the protein tag off. Dr. Dumont described tag cleavage and its problems. Solubilized protein with 3C protease can cleave the peptide tag off, but when the membrane protein is in the membrane it takes much more 3C protease to achieve complete tag removal. Also affinity tags of membrane bound protein don't bind to affinity column as well as detergent solubilized protein. They mentioned that the concentration of protein at the end of preparations is an issue. If one has detergent with large micelle size then one can lose protein. Dr. Dumont gave a meriade of different methods in attempt to produce larger quantities of yeast membrane proteins that are in one conformation.

Tuesday March 20, 2007 10:00 AM to 11:30 AM

Session 2 Promoted Abstracts, Technologies for Bottleneck Solutions

Heath Klock
Microscreening Proteins to Improve the JCSG Pipeline Efficiency

Heath Klock of The Scripps Research Institute and the Joint Center for Structural Genomics presented two protein production techniques that are employed in the JCSG pipeline to expedite construct optimization.

Polymerase Incomplete Primer Extension (PIPE) is a fast, simple and flexible method for cloning and mutagenesis. In the cloning case, vector and insert specific primers are designed to facilitate annealing of the PCR products. In the mutagenesis case, primers are designed to create a single annealing site. Substitutions, deletions and insertions are introduced by slight changes in primer design. To date, the method has been used to generate ~9,800 constructs (5,700 full length and 4,100 truncations/mutants).

Microscreening is used in the JCSG pipeline to evaluate protein characteristics and thus provide important bio-analytical data up-front, which can then help drive pipeline efficiency downstream. A large number of constructs can be processed in parallel and the impact of single mutations and small truncations can be assessed. JCSG micorscreening can process 384 constructs per day with proteins expressed at a 1 ml scale in 96 deep well block. This is 30-fold cheaper and 3-fold faster than production scale operation. Clarified lysates are purified through a Ni-NTA resin in a 96-well format and protein microeluates (~100ul total) are evaluated by various bio-analytical methods. Solubility, identity, cleavability and polydispersity are four key factors in determining if a construct can lead to a structure. A protein assay (Coomassie plus assay reagent) is used to evaluate protein yield from the microscreen, which is highly correlated to obtaining enough protein for crystal trials during scale-up. Thus low yielding constructs can be eliminated at an early stage. Liquid chromatography mass spectrometry (LC-MS), with a 1-3 Dalton mass accuracy, is used to verify protein identity and tag cleavage. Both the intact mass of the sample and the cleavage efficiency of the tag are determined. Polydispersity of the sample is evaluated via analytical size-exclusion chromatography (AnSEC). The resulting chromatograms are manually scored on a 1-4 scale. There is good correlation observed between the AnSEC score and whether or not a structure was solved, with 83% of AnSEC score 4 targets resulting in structures. Thus, microscreening can be used to identify optimal constructs and thus help prioritize which constructs progress through the pipeline.

Craig Bingham
Implementation of PepcDB Reporting at CESG: More Trials, Fewer Tribulations

Dr. Craig Bingham of the University of Wisconsin at Madison and the Center for Eukaryotic Structural Genomics described the latest advances in reporting to the Protein Expression and Purification Database (PepcDB) at CESG. He began by highlighting the importance of PepcDB reporting, in particular because it is a contractual obligation for all PSI-2 centers. PepcDB is an essential vehicle for communicating between centers and to the outside world. PepcDB tracks both positive and negative results for all experimental stages. It provides a complete history, including experimental details, text protocols, stop conditions and contact information. The contents of PepcDB will be one of the enduring legacies of the PSI. PepcDB reporting is inherently more complex than targetDB reporting because it is multi-threaded, where targets can be processed along multiple paths and targets can also move both backwards, as well as forwards in the pipeline. The first implementation of PepcDB reporting at CESG lacked an overall design and basically grew as new code and features were added. In addition, some invalid assumptions were made about how targets could progress through the pipeline. CESG has now redesigned their PepcDB reporting based on the concept that the target workflow can be visualized as a finite directed, acyclic graph. A perl variant of Dot, a language for describing graphs, has been used for the implementation, along with Graphviz, a graph visualization tool. The new reporting tools have been successful in coupling the CESG Sesame database with PepcDB via weekly reports and currently captures 68 protocols for 7553 targets, with 14044 trials and 57195 protocol instances.

Peter Nollert
Advanced Crystal Imaging with the DETECT-X Microscope

Dr. Peter Nollert of deCODE biostructures and the Accelerated Technologies Center for Gene to 3D Structure) described the DETECT-X microscope, a fully automated polarization and UV fluorescence microscope for inspecting crystallization trials. Through the analysis of four polarization images (at 0°, 45°, 90°, 135°) the system calculates quantitative values for the transmission, the birefringence, and the extinction angle at each pixel. Each of these values can then be displayed as a false-color image and together these images provide better contrast for identifying crystalline material. They allow the detection of crystals that are masked by precipitate. Spherulites and microcrystalline material can be readily distinguished from amorphous precipitate. As a result, there are fewer false negatives and more potential leads are identified for crystal optimization. Twinned crystals can also be readily identified, so that the best crystals for diffraction experiments can be selected. The use of in situ UV fluorescence microscopy also helps to distinguish between salt crystals and protein crystals.

Wuxian Xi
– High Throughput Metal Analysis of Proteins

Dr. Wuxian Xi of the Case Center for Synchrotron Biosciences and the New York Structural Genomics Research Consortium described a high-throughput method for identifying and characterizing transition metal content based on X-ray absorption spectroscopy measurements. The experimental setup at beamline X3B at Brookhaven was described, which included and motorized sample translation stage. 1084 samples for NYSGRC were analyzed using the system and 10% were found to contain metals, with the most prevalent being Zn and Fe. Based on 50 structures that have been determined the current instrument is 94% accurate. An example was shown where an RNA methyltransferase was found to contain 4.8 Fe per protein molecule, which was validated against a MODBASE model whose template contained an Fe4S4 cluster. In the case of a DNA-3-methyladenine glucosylase the metal analysis was used to propose a Zn binding site and therefore helped in selecting the best template from 2 homologous structures. The early identification of metals can also be exploited for crystallographic phasing. In the cases processed, 20% of the metalloproteins identified had a theoretical anomalous signal of >3% and could therefore be readily exploited for phasing.

Tuesday March 20, 2007 12:30 PM to 2:00 PM

Session 3 Promoted Abstracts, Technologies for Bottleneck Solutions

Frank Collart
Domain Boundary Approaches to Improve Protein Solubility

Dr. Frank Collart of Argonne National Laboratory and the Midwest Center for Structural Genomics presented a systematic fine grained domain boundary localization approach applied to several hundred MCSG targets, with the goal of improving the chances of identifying a well behaved protein constructs for structural studies by X-ray crystallography. In order to identify candidate protein domains within target proteins, the MCSG uses a library of Hidden Markov Model profiles (HMMs) built for representative sequences from each Pfam domain (Pfam HMM). To improve the expression/solubility outcome, and to expand general knowledge of proteins, the MCSG research team has evaluated various domain boundary extension approaches applied to targets in the MCSG high throughput cloning and expression pipeline. They established a uniform algorithmic method to define probable domain boundaries and overlay them with secondary structure predictions. This allowed the identification of regions of domain boundaries that have predicted overlapping secondary structure extensions. In these cases, the MCSG generated four different engineered constructs for comparative soluble protein expression studies. The four different constructs generated were (i) the full length open reading frame protein; (ii) the pfam HMM domain; (iii) the pfam HMM domain extended N-terminally by 1-30 residues to include any contiguous predicted secondary structure; and (iv) the pfam HMM domain extended C-terminally to the end of the open reading frame. Analysis of the solubility outcome for hundreds of constructs demonstrated that soluble protein production was achieved in approximately equal proportion for the full length proteins and the domain group. Analysis of soluble outcomes of the three different domain groups revealed that the domain extension strategy can improve the identification of soluble protein constructs by ~25-40% as compared to the predicted domain on its own. These preliminary results indicate generation of constructs based on systematic variation in the Pfam HMM domain boundaries provides an effective strategy for generation of soluble domain products and efficient coverage of the represented Pfams.

Seema Sharma
Optimizing Protein Constructs by H/D Exchange Mass Spectrometry

Dr. Seema Sharma of the Robert Wood Johnson Medical School and the Northeast Structural Genomics Consortium presented the use of hydrogen/deuterium exchange mass spectrometry (H/DXMS) in collaboration with ExSAR corporation on proteins selected for structure determination by the NESG. The H/DXMS data have provided information on the domain structure of NESG targets so that better behaved protein constructs can be produced for Nuclear Magnetic Resonance (NMR) by removing highly disordered regions in the proteins. Deuterium exchange of backbone amide protons was conducted based on the protocol outlined by Spraggon et al (Protein Sci. 2004, 13: 3187–3199). An immobilized pepsin column was used for online protein digestion and was aligned with the C18 column used for peptide separation. Peptides were identified by running the ion-trap mass spectrometer in MS/MS mode, while H/D exchange mass shifts were analyzed by collecting the data in full-scan mode. The deuteration levels detected at set time points were normalized to a completely exchanged sample for which the deuterium exchange reaction was allowed to proceed for 24 hours under denaturing conditions (0.5% TFA in D2O). Using a series of test proteins whose NMR structures had been previously determined, it was shown that H/DXMS derived data on the locations of flexible protein regions is in excellent agreement with heteronuclear- Nuclear Overhauser Effect (het-NOE) experimental data. Importantly, the data obtained on disordered protein regions by H/DXMS requires only microgram quantities as compared to the milligram quantities required for het-NOE experiments. It was also shown that construct engineering to remove disordered regions identified by the H/DXMS does not disturb NMR resonance frequencies in the ordered regions of the test proteins, and provides samples more suitable for rapid analysis of NMR assignments and 3D structures. Over time, data from H/DXMS experiments will allow the construction of a database of experimentally determined flexible regions in proteins which should help guide more efficient protein construct design for NMR studies by allowing the prediction and elimination of disordered regions in protein constructs.

Masayori Inouye
Single Protein Production in Living Cells

Dr. Masayori Inouye of the Department of Biochemistry at the Robert Wood Johnson Medical School presented an innovative Single Protein Production (SPP) system designed to produce only a single protein of interest in living cells without producing any other cellular proteins. In order to improve the efficiency of labeling proteins for NMR studies, Inouye and colleagues have engineered E. coli and yeast to carry out the inducible expression of "ACA-less" synthetic genes encoding the protein of interest, together with the MazF, an mRNA interferase which functions as a sequence-(ACA) specific endoribonuclease which specifically cleaves unstructured single-stranded RNA molecules that contain the "ACA" sequence. MazF induction results in complete cell growth arrest. However, if mRNA for a protein of interest is engineered to be devoid of ACA and induced in MazF-expressing cells, only the protein from this mRNA is produced at a yield of 20 to 30% of total cellular protein in the almost complete absence of any other cellular protein synthesis. Since most mRNA molecules naturally contain at least one ACA sequence, the production of MazF in cells causes the destruction of all mRNAs, with the exception of those that lack the "ACA" sequence. With this approach, MazF is used to convert the E. coli cells into a "quasi-dormant" state, lacking all mRNA except for the mRNA encoded by the synthetic "ACA-less" gene. The cells in this growth arrested state can be concentrated and added to labeled medium which results in the specific labeling of only the protein of interest that is encoded by the synthetic "ACA-less" gene of interest. Protein yields are unaffected even if the culture is condensed up to 40-fold, significantly saving the cost (by 97.5%) of expensive materials (e.g. 15N and 13C-labeled compounds) used for protein production. This MazF procedure for producing only one protein of interest in vivo requires that the gene of interest be engineered to silently eliminate all "ACA" nucleotide combinations. Conveniently, the degeneracy of the genetic code allows all possible ACA coding sequences to be changed to alternative synonymous coding sequences. As such, Inouye and colleagues suggest that all synthetic genes be made to eliminate ACA so that they may be compatible with the SPP system. The SPP technology has been applied to the labeling of integral membrane proteins such as the KcsA ion channel, allowing whole cell NMR studies. In addition the Inouye lab is constructing vector systems that allow induction of MazF gene expression in E. coli, mammalian, and yeast systems such that it is in concert with the induction of the expression of the gene of interest.

Michael Sauder
Optimization of SeMet-Labeling in E. coli for High-Throughput Protein Production and Structure Determination

Dr. Michael Sauder of SGX Pharmaceuticals and the New York Structural Genomix Research Center (NYSGXRC), presented studies on the efficiency of SeMet protein labeling used to identify improved parameters for cost efficient and complete SeMet incorporation into proteins for crystallographic structure determination. The NYSGXRC has recently shifted to a mode of operation where all proteins are produced for crystallization trials with SeMet label so that novel protein structures could be solved from single MAD X-ray diffraction data sets. Previously the pipeline had been configured to make protein first without SeMet and once diffraction quality crystals had been produced, new SeMet labeled protein and crystals would be produced for SAD or MAD structure determination. Given the high cost of SeMet substrate, the economics of this new approach required that the NYSGXRC conduct studies on the efficiency of SeMet incorporation. Using the Medicilon HY ("Pink") medium pioneered by the MCSG, the NYSGXRC has established a general protocol where E. coli strains are gown at 37C until an OD595 of 1.2 followed by lowering of the temperature to 22C and adjustment to 60 mg/liter of SeMet (added as a buffered stock). After 20 min. the IPTG inducer is added to 1 mM and cultures are run at 22C for 21 hours before harvesting for protein production. Quantification of SeMet label incorporation by electrospray mass spectrometry of purified protein samples revealed that proteins with large numbers of methionine residues are often not fully labeled with SeMet using the standard protocol with 60 mg/l SeMet label. Increasing the SeMet label to 90 mg/l was found to result in very good complete labeling of most proteins except for those with more than 10 methionine residues. Increasing the SeMet label to 120 mg/ml was found to greatly improve the nearly complete labeling of proteins with more than 10 methionines. The NYSGXRC is tracking all of their protein expression information in an internal database that can report the information directly to the PepCDB. The system can produce work-lists at beginning of each week, including the printing of labels for application to various sample tubes.

Tuesday March 20, 2007 2:30 PM to 4:00 PM

Session 4 Promoted Abstracts, Technologies for Bottleneck Solutions

Ronnie Frederick
Three-Part Small-Scale Screening Platform for the Masses

Dr. Ronnie Frederick of the Center for Eukaryotic Structural Genomics presented three technologies that have been incorporated into their screening pipeline over the past year to improve efficiency and increase protein yields. These include: 1) factorial evolution of auto induction medium to improve growth and expression; 2) expression vector engineering to better match performance in small and large scale trials; and 3) application of automated methods to screen and prepare highly purified proteins. Factorial evolution methods were used to guide the manipulation of carbon sources in the induction medium and the alteration of the promoter level for the Lac repressor leading to a close correlation of small scale versus large scale growths and ultimately to an increase in total protein expression. Dr. Frederick also presented modifications to Gateway and FlexiVector expression vectors to include a linker module containing the TVMV protease prior to the tagged target protein. The engineered vectors allow for constitutive expression of TVMV, with subsequent in vivo proteolysis of the tagged target protein. In addition, these vectors reduce the number of handling steps required to determine whether a target released by proteolysis will retain sufficient solubility to be successfully purified. Finally, data was presented that combined the improved auto induction medium and vectors with the use of the Maxwell 16 Personal Automation System (Promega) to investigate the automated purification utilizing the His8 affinity tag within the vector. Sixteen 1mL bacterial cultures that were produced utilizing the improved auto induction medium and expression vectors were purified in 1 hour on the automated system. These samples yielded enough material to be used for functional analysis, small scale crystallization trials, or 1H, 15N-HSQC NMR.

Liang Li
Microfluidic Hybrid Method For Membrane Protein Crystallization

Dr. Liang Li of the University of Chicago and the Accelerated Technologies Center for Gene to 3D Structure (ATCG3D) presented a novel application of plug-based nanovolume microfluidic crystallization to generate membrane protein crystals. The application combines sparse-matrix screening and optimization gradient screening into one simple experiment utilizing nanoliter-sized plugs to minimize sample consumption. Data was presented showing how large plugs of crystallization cocktail are combined with a protein sample and a dilution buffer to generate multiple 10nL plugs in a microfluidic device for crystallization trials. The concentrations of each reagent were controlled by a computer subroutine and indexed with plug size. This screening method was applied to the crystallization of membrane proteins, which present the added challenges of handling the protein-detergent complex and viscous solutions within the microfluidic device. These challenges were overcome through the use of perflouramines as carrier fluids and the use of Teflon capillaries for the formation, transport, and storage of plugs. High quality crystals of Porin from R. capsulatus and the photosynthetic reaction center from Rhodopseudomonas viridis were obtained using this application.

Alexander McPherson
Searching For Silver Bullets: A Strategy For Crystallizing Proteins

The hypothesis that "various small molecules might establish stabilizing, intermolecular, non-covalent cross links in protein crystals and thereby promote lattice formation", was presented by Dr. Alex McPherson from the University of California, Irvine and the Center for High-Throughput Structural Biology (CHTSB). Three separate studies were undertaken to test the effectiveness of 200 chemical additives on their ability to promote the crystallization of 81 different proteins and viruses. The experimental design was novel in its simplicity. To reduce variables and truly test the ability of the small molecules to promote lattice formation, the experiments were set up using only two precipitating agents: 1) 30% PEG 3350 and 2) 50% Tacsimate, both buffered at pH 7.0. The 200 chemicals were combined to form reagent mixes containing from 3 to 20 different chemicals. Overall, 65 of the 81 samples used in these experiments were crystallized. It is important to note that 35 of the 65 samples required one or more of the chemical additives to crystallize, which demonstrates the importance of these chemicals to promote crystallization.

The results indicate that polyvalent, charged groups (di- and tri-carboxylic acids, diamino compounds, molecules with sulfonyl or phosphate groups) and common biochemicals (coenzymes, biological effectors and ligands) were the two most promising types of reagent mixes. Less promising results were observed with osmolites, polyamines, detergents and sugars. X-ray diffraction validation of the specific structural interactions of these small molecules and their role in crystallization is described for nine biological macromolecules, lending support to the original hypothesis. A group of "silver bullet" cocktails will be integrated into the 1536 screening solutions used by the high-throughput crystallization laboratory at the Hauptman-Woodward Institute within the CHTSB.

Joseph Luft
Crystallization Experiments Designed to Generate Phase Data

Joseph Luft of the The Hauptman Woodward Medical Research Institute and the Center for High Throughput Structural Biology (CHTSB) presented a novel method to rapidly define crystallization phase diagram data using a combination of varied crystallization Drop protein/crystallant Volume Ratio and incubation Temperature (DVR/T). The DVR/T method has been effectively applied to optimize crystallization conditions for a number of samples and its power is fully realized when used in conjunction with automated liquid handling systems.

Data was presented for a number of protein sample optimization trials showing that while conceptually simple, DVR/T experiments are chemically complex. It was stressed that a detailed understanding of the chemistry is not required to make empirical use of the data to devise rational second-tier optimization experiments. Even when an optimization experiment did not lead to a clearly defined crystalline outcome, DVR/T provided a means to identify chemical types and concentrations, protein concentrations, and temperatures that specifically defined regions of the protein/cocktail phase diagram. It was shown that while the method is not quantitative, with a minimal amount of sample (~25L), a rough sketch of a protein/cocktail phase behavior can be generated, enabling for the design of rational second-tier optimization experiments. The results presented indicated that when successful, the above approach increases throughput; not through the increase in the number of experiments going through the pipeline, but rather through the derivation of knowledge from experiments that failed to produce crystals. Hence, downstream optimization experiments can be rationally designed by making use of data that is often discarded.