Protein Structure Initiative Workshop on Target Selection

November 13-14, 2003

Session I: Genome Coverage Based Target Selection

Session I was concerned with the genomic coverage (i.e., comprehensive uniform coverage) of target selection. The speakers provided background information encompassing the structural, functional, and evolutionary space of the protein universe, and addressed the key underlying questions of "How many structures will be needed to cover a high fraction of prokaryotic and eukaryotic protein families?" and "Would Pfam5000 for protein family prioritization be an effective target strategy?"

In his target selection overview, Chris Sander highlighted the goals of the PSI with respect to both coverage of the protein universe and biomedical relevance. To achieve these goals will require structure determination at different levels of granularity (i.e., more structures for large, diverse, and functionally important protein families such as protein kinases) and improvement in the quality of homology modeling to cover broader sequence space. Based on the SCOP structural domains and superfamily classification, Cyrus Chothia showed that (i) structural assignment is possible for about 60% of sequences (or 50% of residues), (ii) distribution of family sizes obeys the power law in which a small number of superfamilies cover large protein space, and (iii) a large fraction (about 75%) of superfamilies have broad evolutionary distribution, containing both prokaryotes and eukaryotes. Andrej Sali presented the MODBASE of comparative structure models (which provides fold assignments for about 56% of all proteins), described the strategy to exploit structurally unknown (un-modeled) families for target selection, and estimated that the number of new structures needed to cover all 6190 PfamA families ranges from 4000 (allowing for any level of similarity) to 25,000 (with a threshold of at least 80aa long at 30% sequence identity).

Probing fold and function space, Christine Orengo mapped Gene3D protein families (of full-length proteins) to CATH and Pfam domains and observed 70% proteome coverage by the largest 2500 superfamilies. Targeting such universal, recurrent, functionally diverse, and biologically significant superfamilies may facilitate our understanding for fold-function relationships and improve function prediction methods. Burkhard Rost presented an integrated target selection strategy that involves automatic identification of domain regions with no homology models, manual refinements to select targets with poor models or predicted novel function, followed by functional annotation of new structures base on structural features. With the coverage of the 8000 largest superfamilies, it was estimated that 50% of residues would remain un-modeled. John Moult discussed several issues for protein family choices, including the consideration of different thresholds, the quality and validity of models, projected family size in future years, and the value of determining structures for small protein families and singletons. To achieve comprehensive and uniform coverage, Steven Brenner proposed using Pfam5000, indexed to the highly curated and well recognized Pfam database, to track the most important protein families as targets. Starting with the largest families and supplemented with other criteria (such as human proteome), the index can be updated to reflect various requirements and changing priorities, and complement the alternative single-genome strategy.

Several common themes emerged from Session I. Targeting large superfamilies is generally accepted as a sound strategy based on the observations that protein family distribution follows the power law and that large families are often evolutionarily conserved and functionally important. It is also generally agreed that the number of structures needed depends on the quality of model building and threshold conditions, and that a finer granularity is required for large and diverse superfamilies to explore structure-function relationships. Converged from the different estimations based on the current fold space of non-randomly sampled protein domains, a rough approximation of between 10,000 to 20,000 structures seem to be required for 50% of amino acid residue coverage of the entire universe of proteins.

Session II: Biomedical Interest in Target Selection

Opening speaker Wayne Hendrickson offered a definition of structural genomics as "structure determination on a pan-genome scale aiming to provide representative structures for all sequence families in all living organisms." He also proposed genome-informed structural biology as a way to conceptualize target selection.  In this paradigm rapid structure determination is used to inspire the definition of function, and structure leads, but does not ignore, function.  He reviewed the application of these concepts to a number of specific systems, as well as to general areas such as drug discovery.

David Eisenberg discussed the limitations of relying entirely on homology-based methods to select targets, especially in the case of protein complexes.  He pointed out that many and perhaps most proteins function in complexes made up of non-homologous members.  In addition, proteins are often crystallizable only with their functional partners.  He reviewed his group' work on Identifying subunits of protein complexes by analyzing the co-evolution of non-homologous Proteins.  He described 4 methods to infer non-homologous protein pairs that have co-evolved and hence are functionally linked, and showed how the union of these methods was a powerful predictor for complexes composed of non-identical subunits.

Wim Hol discussed how to optimize target selection for structure-based drug design and related research.  He emphasized that there is no universal target selection strategy, but that choices depend on the degree of interest.  In addition, there is tremendous variety in how drugs act, and in how they cause side effects.  It has been tentatively estimated that there may be 2 million or so binding sites in the human proteome and transcriptosome.  Dr. Hol also discussed the use of existing databases to narrow choices – for example, to search patents databases to identify as lead compounds proteins for which inhibitors are already in hand.  In conclusion he emphasized that in his view the NIH should focus on Structural Genomics of Human Proteins coupled with Structural Genomics of key proteins and nucleic acids from infectious organisms.

Susan Taylor presented the ensemble of protein kinases – the kinome – as a case study of the issues in selecting targets in an extremely diverse superfamily of great biomedical relevance.

Sung-Hou Kim discussed the experience of the BSGC in using a single organism approach to high throughput structural genomics. A major objective was to provide a global view of one proteome.  This would involve assembling a near-complete structural complement of a minimal organism that might comprise a "core" minimal set of protein architecture.  They also sought to understand the "mapping" of a proteome in protein fold space.

Adam Godzik discussed the mouse genome as a target, especially as a model for the human genome.  Some of the most severe difficulties arise from the domain structures of many mouse (and human) proteins.  Whole proteins are very tough structural targets, but the domain boundaries cannot be predicted accurately, and some domains may not be recognized at all.  He suggested the "bacterialization of the mouse proteome".  Bacterial homologues can be found for ~65% of the mouse proteins, and the efficiency of this process is likely to improve.  Bacterialization may help, but many mouse protein targets will have to be studied directly.

John Markley presented considerations about a plant genome, Arabadopsis, where intrinsic Biomedical interest is relatively low.  However. the genome is relatively uncomplicated, but likely to contain new folds among its 29,000 genes.  Furthermore, complete expression profiles recently published (all ORFs), providing a large field of opportunity for elucidating new fold-function relationships.  The genome codes for proteins that carry out unusual biological and biochemical processes, such as:  tissue remodeling; novel defense mechanisms; regulation of the cycle of yearly cellular processes; biosynthesis of natural products.  Finally there is opportunity to collaborate with the large, highly organized, Arabidopsis community including the NSF 2010 Program, which plans to determine the functions of all Arabidopsis proteins.

Session III: Problem areas in Target Selection; Metrics and Milestones

Jack Greenblatt addressed the experimental approaches needed to systematically purify and identify the protein complexes of yeast and E.coli. An example presented was the RNA polymerase II and associated proteins that form the generalized transcription apparatus.  Results indicate that ~ 90% of the polypeptides involved in transcription and RNA processing are organized into stable protein complexes, forming a huge network involved in functional pathways.  These results clearly indicate long range future directions for the PSI.

B.C. Wang anticipates that the SECSG Center will have the technologies needed to eventually cover about half of big families based on Pfam analysis, using P.furiosus, C. elegans and Human systems as the basis of integration.  He recommends that R/D (research and development), along with production, remain as integral parts of the large structural genomics centers in the future. In addition, the large Centers should incorporate both prokaryotic and eukaryotic systems together with R/D as integral parts of their operations.  This will be the most effective means for reaching the dual goals of large family coverage and biologically significant structures.

Doug Rees discussed recent developments in the analysis of membrane proteins.  Based on sequence analyses of complete genomes, approximately 20% of all proteins are predicted to be integral membrane proteins, with somewhat more than half containing two or more membrane spanning helices.  Despite their functional significance, only ~40 unique structures have been determined to date, which reflects the experimental challenges of mimicking the heterogeneous environment of the membrane with purified components.  A number of promising strategies have been developed, however, that are leading to an increasing number of membrane protein structure determinations.  A rough estimate of the success rate for solving the structure of a particular membrane protein is ~10%, which is comparable to the experience reported for water-soluble proteins from higher eukaryotes.  The greatest challenges to the structure determination of membrane proteins include developing effective approaches for the overexpression of eukaryotic membrane proteins; establishment of more systematic approaches for the effective solubilization of membrane proteins from the bilayer, and the stabilization of specific conformational states of membrane proteins when extracted from the membrane.

Stephen Burley's presentation focused on the need to standardize reporting of productivity Metrics by the PSI centers.  Specifically, he argued that an agreed set of both cumulative and quarterly productivity statistics should be reported on each PSI center web page in a standard format, effective December 1st 2003 (providing statistics for Q1 of grant year 4, 9/1/03-11/30/03). A proposed standard set of productivity statistics was presented and comments were invited from the other PSI center PIs. In closing, arguments were presented in favor of establishing and monitoring Milestones and Timelines to ensure that both the PIs and the NIGMS can effectively track progress and ensure that each center remains on track during the last two years of Phase I and Phase II.


1. Defining the Problem

It is clear that it would be an impossibly large task to try to determine all structures coded by all known genomes.  Likewise, it would be impossible to determine structures of one or more representatives of each known family of proteins.  The estimated size of the task differs dramatically depending on the sequence clustering algorithms used and the granularity of the structural coverage of protein families.  At one end, Christine Orengo calculated that 170,000 experimental structures would be needed to determine reliable homology modeling of all the sequenced genes that do not have significant homology (<35% identity) to any known structures in the PDB.  This approach represents fine granularity of sequence family coverage and is impossible to achieve with the current technology, resources, and timeline.  At the other end, less than 2,500 experimental structures would be needed to cover 70% of the proteomes with coarse granularity (one representative from each super-family).  Burkhard Rost estimated that targeting the biggest 8,000 families using his clustering algorithm will provide 67% coverage of all the sequenced genes.  Steven Brenner offered a detailed evaluation of targeting the 5,000 largest Pfam families (Pfam5000).  The Pfam database collects the largest protein families and is well known by the biological community at large.  It is annotated and well maintained, and could provide a tractable and well defined scope for the next phase of the PSI.  Pfam5000 provides 75% coverage of the sequences in the SwissProt and TrEMBL databases.  It was also noted that the rate of genome sequencing is continually ramping up.  Many more sequence families will be identified in the coming years with more and more genomes completed.  With that in mind, John Moult modeled the growth of the number of protein families versus the growth of sequenced genomes.  Based on his estimates, targeting of the largest families to provide 70% coverage of the existing genomes will give a similar percentage of coverage for all the genomes that will be completed in the next five to ten years. 

2. How Should Targets Be Selected?

Everyone at the meeting seemed to agree that there is unquestioned merit in trying to determine representative structures from the large protein families.  Determining even a single structure can provide information for a large number of family relatives.  Also, there are reasons to expect that families with many members are likely to have important biological functions.  However, the participants did not agree on a single scheme for protein family classification that should dominate the target selection strategy for PSI-2.

There was also discussion as to the merits of determining representative structures from a single genome, as opposed to a family-based approach.  While there may be health-related reasons for focusing on a single organism, analysis was presented suggesting that structures derived from extensive coverage of a single genome do not provide a particularly broad coverage of protein structures in general.  Therefore, global protein family coverage is recommended to be the primary focus for PSI-2, although there may be specific health-related reasons for including other components. 

3. Implementation of Target Selection

It is likely that the community may not agree on a single target list.  Even if this were the case, the relatively low probability of success on a per-protein basis would make it difficult to adhere to a strictly-prescribed set of targets.  A pre-ordained list would also preclude creative approaches and likely be counterproductive.  Thus a self-regulating and dynamically-adaptive approach to target selection is probably preferred over a top-down prescription from the NIH.

There is general agreement to avoid overlap at 30-35% sequence identity, on average.  However, non-uniform coverage at higher and lower granularity will be typical.

One suggestion is to partition the target lists of large centers into three parts:  (1) coarse global genome coverage as determined by a central target selection committee; (2) fine granular coverage of families of particular biomedical interest as determined by the centers; (3) fine granular coverage of families of particular biomedical interest as

determined by a central committee taking recommendations from the scientific community.

4. Communication and Coordination Among Centers and with the Community at Large

A clearing house for registering targets and negotiating specific overlap questions would be desirable.  Coordination in Phase 2 will need to be higher than in Phase 1, as the program reaches a more mature, production-oriented phase.  The NIH cooperative agreement mechanism may serve this purpose.

It has also been appreciated for some time that, in addition to structures, the PSI is accumulating a wealth of information related to biological function as well as protein expression, solubility, purification and crystallization.  This needs to be made available to the community at large.  As the initiative moves toward the second phase this information transfer will become more critical.  Several participants reiterated the importance of the PEPCdb that is being developed at the PDB to serve this need.  Adopting a similar strategy, a more closely-coordinated selection of targets may better serve the initiative.  It may be that some of the individual centers have systems in development that can address this need.  This general question needs to be addressed as the initiative moves forward.

5. The Challenge of Working on Proteins from Higher Eukaryotes

Several of the groups have been exploring the feasibility of determining structures of proteins from higher eukaryotes including human, mouse and C. elegans.  Experience to date suggests that expression of such proteins in E. coli is unsatisfactory in many cases.  Alternative approaches such as the use of baculovirus expression systems are still time-consuming, more expensive, and not suitable for large-scale production.  The comment was made that determination of the structure of a human protein requires much more work than a protein from a prokaryote.  Participants argued for different goals for more challenging proteins.

6. Biological and Biomedical Relevance

There was general recognition that PSI-2 needs to include a component that allows for focus on specific health-related areas.  The balance between genome coverage goal and biomedical impact goal was discussed, but not clearly established.

7. Membrane Proteins

During the past two or three years the number of structures of intrinsic membrane proteins has jumped dramatically.  Individuals knowledgeable in this area feel that there is considerable reason for optimism and that the number of membrane protein structures being determined may continue to increase rapidly.  If so, this is extremely encouraging news for the PSI as a whole since 25% or so of a typical proteome codes for membrane-bound proteins.  It was noted that while many of the first membrane protein structures determined to date were abundant and could be purified from their native environment, this situation has been changing with the successful structure determinations of over-expressed prokaryotic channels and transporters.  The development of robust systems to overexpress eukaryotic membrane proteins that are of intrinsically low abundance remains a critical challenge.

8. Protein Complexes

Both theoretically-based and experimentally-based methods were described for identifying proteins that form complexes.  These approaches are becoming more powerful but have not been developed to the point that they can routinely facilitate the purification, crystallization and structure determination of such complexes.  If anything, recent results tend to emphasize the complexity that may underlie protein-protein associations.  Further work is clearly required in this area.

Submitted by the Protein Structure Initiative Advisory Committee

David Davies

Chris Dobson

Eaton Lattman

Brian Matthews, Chair

Rowena Matthews

Franklyn Prendergast

Chris Sander

Gerhard Wagner

Cathy Wu

December 11, 2003