NIGMS Structural Genomics Targets Workshop

Location

NIH campus, Bethesda, MD.

Start Date

Thu, 02/11/1999 - 8:00 AM

End Date

Fri, 02/12/1999 - 5:30 PM

Introduction
Summary
Session 1: Applications of Structural Genomics
Session 2: Extracting Knowledge from Structures
Session 3: Sequence Family Organization
Session 4: Current and Planned Structural Genomics Initiatives
Session 5: Protein Structure Initiative website
Session 6: Coordination and Cooperation
Participants, Observers, and Federal Agency Staff in Attendance

Introduction

This report describes the proceedings and outcome of a workshop on structural genomics that was sponsored by the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH). The workshop was held on the NIH campus in Bethesda, MD, on February 11-12, 1999. It focused on issues concerned with the selection of targets for structural genomics. There were 34 participants from the United States, the United Kingdom, France, Germany, Israel, and Japan, as well as Federal staff and scientific observers. A full list of attendees is available. The workshop was organized by Dr. Marvin Cassman, director of NIGMS, and Dr. John Norvell, a program director at NIGMS. Two earlier NIGMS workshops on structural genomics have been held, and reports from these meetings are available on the NIGMS website at http://www.nigms.nih.gov/News/Meetings/.

Summary

Inspired by the scientific impact of the genome sequencing projects, there has been considerable agreement on the value of a large-scale structure determination project designed to cover all protein structures. This initiative is often referred to as "structural genomics." Two important near-term objectives are: (1) selecting targets for structure determination so as to obtain maximum information return on the total effort and (2) developing a mechanism that will facilitate cooperation and prevent work duplication between the experimental laboratories involved.

It is tempting to declare the goal of structural genomics as the determination of the 3-D structure of all human proteins or of the complete sets of proteins in particular functional classes, such as all enzymes or all cell-surface receptors. However, in spite of tremendous technical advances and falling costs per protein structure determined in the laboratory, solving on the order of 100,000 structures represents a huge effort. There are, fortunately, approaches to reducing the scale of such an effort and to achieving early returns.

The first is to organize protein sequence databases in terms of families and to target only one representative in each family for experimental structure determination. Model building by homology may then be used to produce reasonable structures of the other family members based on sequence similarity. The scientific basis for this is strong conservation of protein 3-D shape across large evolutionary distances, within single species, and between species--in spite of sequence variation. This also provides the option to choose any one of the proteins in a family as the structure target rather than to struggle to achieve over-expression and crystallization of one particular protein.

Another approach to economizing on effort is to prioritize protein families in terms of scientific interest and to solve high-priority structures first. This is similar in approach to early emphasis on cDNA sequencing before completing a genomic sequence.

The workshop brought together structural and computational biologists, decision makers, and representatives of funding agencies, as well as scientific observers, to discuss these issues. As presented in this report, a number of projects are in place in different countries, and rapid progress is being made toward forming larger initiatives that are likely to establish structural genomics as a significant field over the next 12 months.

Recommendations were also presented--and well received--to work toward open global exchange of scientific information and a productive level of international coordination aimed at reducing duplication of effort and based on active exchange of information on the targets selected for structure determination.

Given the current global production of more than 1,000 structures per year, a reasonable challenge for the global structural genomics effort can be encapsulated in the words "10,000 structures within 5 years."

Session 1: Applications of Structural Genomics

The workshop opened with presentations from Stephen Burley (Rockefeller University/HHMI), Gregory Petsko (Brandeis University), and Leigh English (Monsanto Company), covering some of the medical, pharmaceutical, and agricultural implications of a systematic program of protein structure determination. There was agreement among all three speakers that a structural genomics program would have considerable collateral benefit for biomedical research.

Specific scenarios were presented by Burley, suggesting the ways in which a very large-scale effort (i.e., going beyond simply doing 10,000 structures of individual domains) could provide valuable information regarding the structure and function of human protein and RNA gene products involved in the development and/or treatment of disease. A similar strategy was advocated for bacterial and viral virulence factors. In addition, the availability of large amounts of highly purified proteins (and RNAs) could enable a large-scale effort aimed at functional characterization of these same gene products. Burley closed by asserting that the medical benefits of structural genomics are likely to result in a substantive improvement in our ability to understand a large number of interrelated normal and pathogenic biological processes.

Petsko sounded a note of caution against the danger of overselling the immediate promise of structural genomics for drug design, a point also made by Burley during his introduction. Citing various statistics concerning the difficulty of developing a drug, Petsko went on to explain how a structural genomics effort could contribute to the drug discovery process. He highlighted the possibility of identifying new targets and understanding the binding of lead compounds to these target proteins or RNAs. In addition, Petsko presented some sobering examples of protein structures that did not provide any significant insight into biochemical function. He also emphasized the need for functional characterization.

The presentation by English on the agricultural impact of structural genomics demonstrated that there is already considerable effort being devoted to structure-based characterization of proteins with favorable properties, such as herbicide or insect pest resistance. It seemed clear to the audience that many of the caveats mentioned by Petsko would apply to the role of structural biology in the agricultural industry. English was more enthusiastic about the financial and social benefits of structural genomics accruing to agriculture than Petsko was for the pharmaceutical industry.

Session 2: Extracting Knowledge from Structures

This session focused on ways of using structural knowledge, specifically: (1) using solved structures as the starting point of solving related structures in crystallography; (2) building models of protein structures based on evolutionary homology to proteins determined experimentally; and (3) deducing function from both experimental structures and models. The ability to build useful models of proteins in a family by homology to a solved representative is a key cost-savings factor in structural genomics.

Paula Fitzgerald (Merck Research Laboratories) discussed the prospects of combining a basis set of structures from the structural genomics project with molecular replacement methods to obtain X-ray structures of other proteins. Molecular replacement techniques use a model of the structure of interest and systematically search all possible orientations and translations of the model within the asymmetric unit of a crystal, assessing the fit in terms of the agreement between observed and calculated diffraction amplitudes. When successful, the technique avoids the necessity of experimentally determining the phases of the diffraction data, and therefore, greatly simplifies the task of obtaining the structure. In particular, only "native" data are needed, and the tunable wavelength provided by a synchrotron is not necessary, so structures can be solved on apparatus found in most labs. On the negative side, model bias in the initial structure is hard to eliminate, and hence, refinement can take much longer than with a de novo structure determination.

Successful use of molecular replacement primarily depends on two factors: (1) the quality of the model of the protein available and (2) the nature of the crystal symmetry, packing, and diffraction quality. Generally, sequence identity of at least 35% between the model structure and the target protein is desirable, but what counts is the structural similarity between the two proteins, and that does not map directly to overall sequence identity. Two proteins with very low sequence identity may have very similar structures, while a hinge motion conformational change may make even a structure with 100% sequence identity to the target an unsuitable model. The number of insertions and deletions in one sequence relative to another is also a significant factor.

The best search model is usually a compromise between completeness and accuracy. A better signal is obtained by omitting regions of doubtful reliability, such as side chains, high temperature factor segments, and sites of possible insertions and deletions. Desirable properties of the crystal are a low symmetry space group, a high solvent content, and a low number of molecules in the asymmetric unit, along with accurately measured strong diffraction data.

Because of the experimental advantages, molecular replacement will be used to solve many structures of proteins related to the structural genomics set. The finer the granularity of the sampling of structural space, the more useful the technique will be. However, the downsides--problems of model bias and suboptimal crystal forms--will mean it is not a universal solution.

Model building by homology was covered by Andrej Sali (Rockefeller University), who discussed the reliability of comparative modeling in the context of structural and functional genomics. In general, mistakes in comparative modeling include side chain packing errors, small distortions and rigid body shifts in correctly aligned regions, errors in inserted regions (loops), incorrect alignments, and incorrect templates.

Fortunately, a 3-D model does not have to be perfect to be helpful in biology. The type of question that can be addressed with a particular model depends on its accuracy. A convenient and simple predictor of model accuracy is the percentage sequence identity to the template on which the model was based. This relationship was described for an automated comparative modeling procedure implemented in the computer program "Modeller." There are three accuracy classes for comparative models.

At the low end of the accuracy spectrum, there are models that are based on less than 30% sequence identity and have sometimes less than 50% of the C-alpha atoms within 3.5A of their correct positions. However, such models still have the correct fold, and even knowing only the fold of a protein is frequently sufficient to predict its approximate biochemical function.

In the middle of the accuracy spectrum are the models based on approximately 35% sequence identity, corresponding to 85% of the C-alpha atoms modeled within 3.5A of their correct positions. In such cases, it is frequently possible to predict correctly important features of the target protein that do not occur in the template structure. For example, the location of a binding site can be predicted from clusters of charged residues, and the size of a ligand can be predicted from the volume of the binding site cleft.

Another use of 3-D models is that some binding and active sites, which cannot possibly be found by searching for local sequence patterns, frequently should be detectable by searching for small 3-D motifs that are known to bind or act on specific ligands. Medium resolution models frequently allow a refinement of the functional prediction based on sequence alone because ligand binding is most directly determined by the structure of the binding site rather than its sequence.

Even when the conserved binding sites are present in the templates, comparative models can still add value to the sequence-based analysis. For example, they can be used to construct site-directed mutants with altered or destroyed binding capacity, which in turn could test hypotheses about the sequence-structure-function relationships.

Other problems that can be addressed with medium resolution comparative models include using molecular replacement in X-ray crystallography; designing proteins that have compact structures without long tails, loops, and exposed hydrophobic residues for better crystallization; or designing proteins with added disulfide bonds for extra stability.

The high end of the accuracy spectrum corresponds to models based on 50% sequence identity or more. The average accuracy of these models approaches that of low resolution X-ray structures (3A resolution) or medium resolution nuclear magnetic resonance (NMR) structures (ten distance restraints per residue). In addition to the already listed applications, high-quality models can be used for docking of small ligands into a protein or for docking of a protein to a protein.

Most models that can currently be calculated for whole genomes are spread evenly between the low- and medium-accuracy classes. Large-scale comparative modeling opens new opportunities for tackling existing problems by virtue of providing many protein models from many genomes. One example is the selection of a target protein for which a drug needs to be developed. A good choice is a protein that is likely to have high ligand specificity. Large-scale modeling facilitates imposing the specificity filter in target selection by enabling a structural comparison of the ligand binding sites of many proteins, either human or from other organisms.

In summary, there is a sharp rise in information when a comparative model is based on more than 30% sequence identity and has no long insertions relative to the template. Thus, a useful sampling density in target selection for structural genomics would put each sequence within at least 30% sequence identity to a known structure. Even finer sampling is desired for more important families. This implies at least 10,000 targets for structural genomics.

John Moult (Center for Advanced Research in Biotechnology, University of Maryland) discussed the current and emerging methods for deriving functional information from structure. The information derived from structural genomics is primarily about molecular function--the binding sites and their specificity, catalytic activities, and regulation by conformational changes and covalent modifications. This type of information provides one component of the complete array of data constituting functional genomics, with cellular and phenotypic function being provided by complementary techniques such as transcriptional mapping and genetics.

Molecular function can be understood at different levels of resolution. A low-resolution description will identify the class of function--the protein is an enzyme, or a DNA-binding protein, or a cell receptor, for example. At medium resolution, the primary function can be identified--it's a beta-lactamase enzyme, or a protease, or a DNA-binding protein that acts as a repressor, for example. At high resolution, the primary specificity is known, e.g., the protein is a beta-lactamase effective against penicillins rather than cephalosporins, or the repressor acts on the lac operon. Methods of deriving function from structure strive to get to as high a resolution as possible.

Goals include: (1) the assignment of function to "hypothetical" proteins and proteins with only a cellular or phenotypic function identified; (2) the assignment of functional differences within a sequence family; and (3) the interpretation of data on disease associated single nucleotide polymorphisms (SNPs). Experience to date with assigning function to experimental structures of "hypotheticals" has been encouraging, with medium resolution descriptions usually obtained in a straightforward manner. SNP functional analysis and the assignment of function within sequence families usually depends on the quality of model that can be built, in addition to the functional analysis tools available. There are so far few examples of structures of proteins with SNPs analyzed, but it is expected that in the medium- to high-resolution range of homology modeling, useful information will be obtained.

For assignment of function within families, the critical factor becomes the quality of the model. Experience with testing of modeling methods at CASP (Critical Assessment of Techniques for Protein Structure Prediction) shows that "threading" methods are not yet developed to the point where they can reliably identify the fold a sequence adopts. Further, when the methods are successful, errors in the resulting models are generally too high to allow more than low resolution functional information to be derived. These results, and data from homology modeling tests, strongly suggest that to be effective, structural genomics must sample sequence space such that every remaining sequence is at least 30% and preferably 40% identical to that of sample structure.

There are a number of computational tools already developed that are useful for deducing function from structure, and more are on the way. Surface electrostatics, as displayed, e.g., by the program GRASP (A. Nicholls, K. Sharp, and B. Honig, Columbia University) can usually identify DNA and RNA binding sites, and occasionally, other functional features as well. Janet Thornton's group (University College, London) has shown that small ligand binding sites are almost always associated with the largest depressions in the surface of a protein, and these can be found automatically or by visual inspection. Examining the extent of residue conservation of a family of sequences displayed on the surface of a structure has emerged as a powerful method of finding more general functional features, particularly protein-protein interaction sites. Several groups are investigating the extent to which three-dimensional catalytic motifs can be catalogued and used to identify the catalytic function of new structures. Finally, the methods developed in drug design to identify potential lead compounds are expected to be applicable to deducing ligand-binding specificity.

Overall, results so far are most encouraging, and even without further improvement in modeling methods, large-scale structural genomics will provide a wealth of detailed functional information. Emerging techniques will provide additional tools for defining function. However, in many cases, other experimental data will be required to confirm hypotheses and to extend the functional description to high resolution.

Session 3: Sequence Family Organization

This session reviewed what is known about protein sequences organized into families and how properties of protein families can be used to prioritize families for structure determination and to pick one or more family representatives as structural targets.

A critical requirement for the selection of targets for structural genomics is the comprehensive organization of protein sequences into families. Key international research groups working on large-scale clustering of protein sequences contributed to the workshop by presenting their scientific methodology and by sharing their protein family collection on the Protein Structure Initiative (PSI) website.

The providers of organized protein families, family attributes, and pre-selected targets were (presenter in italics):

COGs
Yuri I. Wolf, Roman L. Tatusov, Michael Y. Galperin, Eugene V. Koonin--National Institutes of Health, Bethesda, Maryland

Pfam
Alex Bateman, Ewan Birney, Richard Durbin, Sean Eddy, Kevin Howe, Erik Sonnhammer--Sanger Center, Cambridge, UK; Washington University, St. Louis, Missouri; Karolinska Institute, Stockholm, Sweden

Picasso
Liisa Holm--EMBL-EBI, European Bioinformatics Institute, Cambridge, UK

PIR
Winona C. Barker, Geetha Y. Srinivasarao, Lai-Su Yeh, Cristopher R. Marzec, J.S. Garavelli, Friedhelm Pfeiffer, Cathy Wu, Hongzhan Huang--Georgetown University Medical Center, Washington, D.C.

ProClass
Cathy H. Wu, Hongzhan Huang--The University of Texas Health Center at Tyler, Texas

ProDom
Florence Servant, Jerome Gouzy, Florence Corpet, Daniel Kahn--INRA/CNRS, Toulouse, France

ProtoMap
Elon Portugaly, Nathan Linial, Golan Yona, Michal Linial--Hebrew University, Jerusalem, Israel

Systers
Antje Krause, Martin Vingron--German Cancer Research Center, Heidelberg, Germany

Groups used different methods for sequence alignment and clustering, resulting in different family collections organized at different levels of granularity. The collections are not strictly comparable, as different size protein sequence collections were used, as well as different criteria for family membership and different levels of automation, including extensive hand curation for some of the collections. Details of the methodology are available directly on the groups' websites, and an overview is on the PSI website.

The Human Genome Project and other sequencing projects have so far provided about 400,000 protein sequences, of which about 200,000 are unique at the level of 90% sequence identity. These sequences are partly derived from tens of complete microbial genomes, the complete C. elegans genome, as well as a rapidly growing fraction of all proteins from the human genome and other model organisms. Although we are far from complete coverage, this collection is an extremely rich source of targets for structural genomics.

The family collections and pre-selected targets submitted by the different groups (see the PSI website) give a first overview of the scope of the problem. A key issue is the overall scale of the effort. In spite of the incompleteness of current sequence collections and the differences in methodology, clear trends are apparent. Choosing different levels of granularity, i.e., imposing different quality criteria on models built by homology from representative, experimentally determined structures, leads to different numbers of family representatives as targets (Fig. 1).

Figure 1:

Total effort versus model quality. The graph can be used to estimate the relative savings that come from solving representative structures and then modeling their neighbors by homology, rather than solving all structures. (The band represents values available from different approaches to organizing various protein sequence collections--see below.) In choosing the desired level of savings, there is a tradeoff between model quality (increases toward the right) and savings (increases toward the left). A reasonable initial specification sets desired model quality at the minimal sequence agreement between model and template at 35% identical residues (indicated by a dot). Making several assumptions about the degree of family unification, this corresponds to solving the structure of about 1 in every 10-16 proteins (relative effort 0.1-0.06). How many structures does this correspond to in absolute terms? An effort of 1.0 corresponds to solving a 3-D structure for every known protein (currently about 200,000 after removal of one of every pair above 90% sequence identity); an effort of 0.06 therefore implies solving 12,000 structures. Over the next few years, one may expect that the total number of known proteins sequences across important model organisms will rise 5- to 10-fold, to 1 to 2 million, while the level of family unification as a result of sequence comparison will probably also increase substantially--two opposing effects, with currently unknown numerical outcome on the estimated total effort for a comprehensive structural genomics project.

Details: Effort is defined as the ratio r = (number of structures solved experimentally)/(total number of protein sequences modeled in 3-D). Model quality is measured by the minimum required sequence agreement, in percent identical amino acids (% idAA), between modeled and experimental structure (see text); this measure can be translated to accuracy in atomic positions ("rmsd') using well-established methods. The graph is based on preliminary data from different protein family collections. Each collection starts from a different total number of sequences (equivalenced in this figure at an effort of 1.0, for comparison) and uses a different method of family unification (apparent from the width of the band). Details of the family collections are at http://www.structuralgenomics.org and at the groups' own Web pages. Graph compiled by D. Vitkup, E. Melamud, J. Moult, and C. Sander (unpublished).

For reasonable parameters, the total effort in "number of structures solved to cover all currently known protein families" is on the order of 12,000 structures (for a homology building 'distance' corresponding to 35% identical residues between model and template).

Although there are unresolved issues with respect to special protein classes, such as the difficulty of crystallization of membrane proteins, the scale of the problem of large-scale structural genomics is emerging in quantitative detail. Most of the human genome sequence is likely to become available next year. Consequently, the sequence analysis research groups will probably be in a position to determine the definitive number of target structures needed for exhaustive coverage of all human protein families by the middle of the year 2000.

Session 4: Current and Planned Structural Genomics Initiatives

This session provided an overview of ongoing and planned structural genomics projects in the United States, Europe, and Japan.

The structural genomics initiatives fall into two parts: "pilot projects" underway in various groups in the United States and developing projects elsewhere in the world. Six U.S. pilot projects were represented. David Eisenberg (UCLA) described ongoing work on the aerobic bacterium Pyrobaculum aerophilum . The experimental part of this project is focused on assigning folds to gene products and determining structures of medically relevant proteins. Medical relevance is identified by homology with protein sequences in the National Center for Biotechnology Information's Online Mendelian Inheritance in Man database, and a surprising 24% of open reading frames are so related. Theoretical aspects of the project focus on predicting which open reading frame represent new folds, the prediction of protein-protein interactions, and pathway identification.

Sung Ho Kim (UC Berkeley) described work in his group on the hyperthermophilic archaeon Methanococcus jannaschii . The advantages of this organism are its "deep rooted" position in the phylogenetic tree of organisms, and the relative ease of protein purification and crystallization from hyperthermophiles.

Bill Studier (Brookhaven National Laboratories) described work on a joint BNL/Rockefeller project focused on proteins from yeast. Yeast was chosen because of the high proportion of human-related proteins and the possibility of involving the large community of scientists already working in this area. The initial emphasis is on testing the feasibility of high-throughput methods.

Guy Montelione (Rutgers University) described the only U.S. project placing more emphasis on NMR as compared with X-ray crystallography as the primary experimental method of structure determination. Targets are selected on the basis of relevance to human disease, broad conserved metazoan genes, and genes of human pathogens. Rather than focusing on one organism, a set of homologous proteins from different sources are expressed.

Paul Bash (Northwestern University) described a project aimed at broad coverage of structure space, particularly obtaining a "basis set" of protein folds spanning structure and sequence space and a set of structures with a wide phylogenetic distribution--in some sense universal proteins. Goals are discovering new folds, providing a set of structures to be used for molecular replacement in crystallography, and obtaining an understanding of protein evolution. The group has contracted out the production of protein and finds this a satisfactory and economic solution for that stage of the process.

John Moult (CARB, University of Maryland) reported on a joint CARB/TIGR project to obtain structures for the unannotated open reading frames from Haemophilus influenzae (HI). HI was chosen because it is the smallest free-living genome available. The primary goal of the project is to directly and indirectly obtain information about the function of these "hypothetical" proteins, and thus, speed discovery of the complete biochemical processes in a simple cell. To this end, an important part of the project is to make the information and materials obtained available to a wide community of experimentalists.

A total of approximately 12 structures were reported as already solved in these pilot projects. Analysis of these structures is generally still in progress, but for the proteins where the function was previously unknown, some functional insight has been easily obtained in all but one instance. Many more proteins are in the pipeline, and we can expect a flood of structures in the next year. All groups reported an encouraging level of success with expression, purification, and crystallization. Although it is still early to draw definite conclusions, results from these pilot projects are very promising. It was apparent that each group has taken a quite different view of how to choose targets.

On the international landscape, Udo Heinemann (Max Delbrueck Center, Berlin) described the "Protein Structure Factory". This is a consortium of groups, funded in the Berlin region by the German Ministry for Research (BMBF), with approximately $20 million over 5 years, and an expectation of producing at least 100 structures. Among others, thermostable proteins will be studied, primarily selected to represent new folds, antibiotic targets, or translation components. Proteins will be expressed in E. coli, and there will be a strong emphasis on automation and parallel processing at all stages--particularly for protein production and crystallization. Dedicated beam facilities will be built at the Bessy II synchrotron in Berlin.

Shigeyuki Yokoyama (University of Tokyo) described three Japanese initiatives in structural genomics. Most advanced is a very large-scale investment in NMR facilities to be partly used for determining t

This page last updated on 02/25/2025 5:06 PM