NIGMS Structural Genomics Targets Workshop

February 11-12, 1999


Introduction

This report describes the proceedings and outcome of a workshop on structural genomics that was sponsored by the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH). The workshop was held on the NIH campus in Bethesda, MD, on February 11-12, 1999. It focused on issues concerned with the selection of targets for structural genomics. There were 34 participants from the United States, the United Kingdom, France, Germany, Israel, and Japan, as well as Federal staff and scientific observers. A full list of attendees is available. The workshop was organized by Dr. Marvin Cassman, director of NIGMS, and Dr. John Norvell, a program director at NIGMS. Two earlier NIGMS workshops on structural genomics have been held, and reports from these meetings are available on the NIGMS Web site.


Summary

Inspired by the scientific impact of the genome sequencing projects, there has been considerable agreement on the value of a large-scale structure determination project designed to cover all protein structures. This initiative is often referred to as "structural genomics." Two important near-term objectives are: (1) selecting targets for structure determination so as to obtain maximum information return on the total effort and (2) developing a mechanism that will facilitate cooperation and prevent work duplication between the experimental laboratories involved.

It is tempting to declare the goal of structural genomics as the determination of the 3-D structure of all human proteins or of the complete sets of proteins in particular functional classes, such as all enzymes or all cell-surface receptors. However, in spite of tremendous technical advances and falling costs per protein structure determined in the laboratory, solving on the order of 100,000 structures represents a huge effort. There are, fortunately, approaches to reducing the scale of such an effort and to achieving early returns.

The first is to organize protein sequence databases in terms of families and to target only one representative in each family for experimental structure determination. Model building by homology may then be used to produce reasonable structures of the other family members based on sequence similarity. The scientific basis for this is strong conservation of protein 3-D shape across large evolutionary distances, within single species, and between species--in spite of sequence variation. This also provides the option to choose any one of the proteins in a family as the structure target rather than to struggle to achieve over-expression and crystallization of one particular protein.

Another approach to economizing on effort is to prioritize protein families in terms of scientific interest and to solve high-priority structures first. This is similar in approach to early emphasis on cDNA sequencing before completing a genomic sequence.

The workshop brought together structural and computational biologists, decision makers, and representatives of funding agencies, as well as scientific observers, to discuss these issues. As presented in this report, a number of projects are in place in different countries, and rapid progress is being made toward forming larger initiatives that are likely to establish structural genomics as a significant field over the next 12 months.

Recommendations were also presented--and well received--to work toward open global exchange of scientific information and a productive level of international coordination aimed at reducing duplication of effort and based on active exchange of information on the targets selected for structure determination.

Given the current global production of more than 1,000 structures per year, a reasonable challenge for the global structural genomics effort can be encapsulated in the words "10,000 structures within 5 years."


Session 1: Applications of Structural Genomics

The workshop opened with presentations from Stephen Burley (Rockefeller University/HHMI), Gregory Petsko (Brandeis University), and Leigh English (Monsanto Company), covering some of the medical, pharmaceutical, and agricultural implications of a systematic program of protein structure determination. There was agreement among all three speakers that a structural genomics program would have considerable collateral benefit for biomedical research.

Specific scenarios were presented by Burley, suggesting the ways in which a very large-scale effort (i.e., going beyond simply doing 10,000 structures of individual domains) could provide valuable information regarding the structure and function of human protein and RNA gene products involved in the development and/or treatment of disease. A similar strategy was advocated for bacterial and viral virulence factors. In addition, the availability of large amounts of highly purified proteins (and RNAs) could enable a large-scale effort aimed at functional characterization of these same gene products. Burley closed by asserting that the medical benefits of structural genomics are likely to result in a substantive improvement in our ability to understand a large number of interrelated normal and pathogenic biological processes.

Petsko sounded a note of caution against the danger of overselling the immediate promise of structural genomics for drug design, a point also made by Burley during his introduction. Citing various statistics concerning the difficulty of developing a drug, Petsko went on to explain how a structural genomics effort could contribute to the drug discovery process. He highlighted the possibility of identifying new targets and understanding the binding of lead compounds to these target proteins or RNAs. In addition, Petsko presented some sobering examples of protein structures that did not provide any significant insight into biochemical function. He also emphasized the need for functional characterization.

The presentation by English on the agricultural impact of structural genomics demonstrated that there is already considerable effort being devoted to structure-based characterization of proteins with favorable properties, such as herbicide or insect pest resistance. It seemed clear to the audience that many of the caveats mentioned by Petsko would apply to the role of structural biology in the agricultural industry. English was more enthusiastic about the financial and social benefits of structural genomics accruing to agriculture than Petsko was for the pharmaceutical industry.


Session 2: Extracting Knowledge from Structures

This session focused on ways of using structural knowledge, specifically: (1) using solved structures as the starting point of solving related structures in crystallography; (2) building models of protein structures based on evolutionary homology to proteins determined experimentally; and (3) deducing function from both experimental structures and models. The ability to build useful models of proteins in a family by homology to a solved representative is a key cost-savings factor in structural genomics.

Paula Fitzgerald (Merck Research Laboratories) discussed the prospects of combining a basis set of structures from the structural genomics project with molecular replacement methods to obtain X-ray structures of other proteins. Molecular replacement techniques use a model of the structure of interest and systematically search all possible orientations and translations of the model within the asymmetric unit of a crystal, assessing the fit in terms of the agreement between observed and calculated diffraction amplitudes. When successful, the technique avoids the necessity of experimentally determining the phases of the diffraction data, and therefore, greatly simplifies the task of obtaining the structure. In particular, only "native" data are needed, and the tunable wavelength provided by a synchrotron is not necessary, so structures can be solved on apparatus found in most labs. On the negative side, model bias in the initial structure is hard to eliminate, and hence, refinement can take much longer than with a de novo structure determination.

Successful use of molecular replacement primarily depends on two factors: (1) the quality of the model of the protein available and (2) the nature of the crystal symmetry, packing, and diffraction quality. Generally, sequence identity of at least 35% between the model structure and the target protein is desirable, but what counts is the structural similarity between the two proteins, and that does not map directly to overall sequence identity. Two proteins with very low sequence identity may have very similar structures, while a hinge motion conformational change may make even a structure with 100% sequence identity to the target an unsuitable model. The number of insertions and deletions in one sequence relative to another is also a significant factor.

The best search model is usually a compromise between completeness and accuracy. A better signal is obtained by omitting regions of doubtful reliability, such as side chains, high temperature factor segments, and sites of possible insertions and deletions. Desirable properties of the crystal are a low symmetry space group, a high solvent content, and a low number of molecules in the asymmetric unit, along with accurately measured strong diffraction data.

Because of the experimental advantages, molecular replacement will be used to solve many structures of proteins related to the structural genomics set. The finer the granularity of the sampling of structural space, the more useful the technique will be. However, the downsides--problems of model bias and suboptimal crystal forms--will mean it is not a universal solution.

Model building by homology was covered by Andrej Sali (Rockefeller University), who discussed the reliability of comparative modeling in the context of structural and functional genomics. In general, mistakes in comparative modeling include side chain packing errors, small distortions and rigid body shifts in correctly aligned regions, errors in inserted regions (loops), incorrect alignments, and incorrect templates.

Fortunately, a 3-D model does not have to be perfect to be helpful in biology. The type of question that can be addressed with a particular model depends on its accuracy. A convenient and simple predictor of model accuracy is the percentage sequence identity to the template on which the model was based. This relationship was described for an automated comparative modeling procedure implemented in the computer program "Modeller." There are three accuracy classes for comparative models.

At the low end of the accuracy spectrum, there are models that are based on less than 30% sequence identity and have sometimes less than 50% of the C-alpha atoms within 3.5A of their correct positions. However, such models still have the correct fold, and even knowing only the fold of a protein is frequently sufficient to predict its approximate biochemical function.

In the middle of the accuracy spectrum are the models based on approximately 35% sequence identity, corresponding to 85% of the C-alpha atoms modeled within 3.5A of their correct positions. In such cases, it is frequently possible to predict correctly important features of the target protein that do not occur in the template structure. For example, the location of a binding site can be predicted from clusters of charged residues, and the size of a ligand can be predicted from the volume of the binding site cleft.

Another use of 3-D models is that some binding and active sites, which cannot possibly be found by searching for local sequence patterns, frequently should be detectable by searching for small 3-D motifs that are known to bind or act on specific ligands. Medium resolution models frequently allow a refinement of the functional prediction based on sequence alone because ligand binding is most directly determined by the structure of the binding site rather than its sequence.

Even when the conserved binding sites are present in the templates, comparative models can still add value to the sequence-based analysis. For example, they can be used to construct site-directed mutants with altered or destroyed binding capacity, which in turn could test hypotheses about the sequence-structure-function relationships.

Other problems that can be addressed with medium resolution comparative models include using molecular replacement in X-ray crystallography; designing proteins that have compact structures without long tails, loops, and exposed hydrophobic residues for better crystallization; or designing proteins with added disulfide bonds for extra stability.

The high end of the accuracy spectrum corresponds to models based on 50% sequence identity or more. The average accuracy of these models approaches that of low resolution X-ray structures (3A resolution) or medium resolution nuclear magnetic resonance (NMR) structures (ten distance restraints per residue). In addition to the already listed applications, high-quality models can be used for docking of small ligands into a protein or for docking of a protein to a protein.

Most models that can currently be calculated for whole genomes are spread evenly between the low- and medium-accuracy classes. Large-scale comparative modeling opens new opportunities for tackling existing problems by virtue of providing many protein models from many genomes. One example is the selection of a target protein for which a drug needs to be developed. A good choice is a protein that is likely to have high ligand specificity. Large-scale modeling facilitates imposing the specificity filter in target selection by enabling a structural comparison of the ligand binding sites of many proteins, either human or from other organisms.

In summary, there is a sharp rise in information when a comparative model is based on more than 30% sequence identity and has no long insertions relative to the template. Thus, a useful sampling density in target selection for structural genomics would put each sequence within at least 30% sequence identity to a known structure. Even finer sampling is desired for more important families. This implies at least 10,000 targets for structural genomics.

John Moult (Center for Advanced Research in Biotechnology, University of Maryland) discussed the current and emerging methods for deriving functional information from structure. The information derived from structural genomics is primarily about molecular function--the binding sites and their specificity, catalytic activities, and regulation by conformational changes and covalent modifications. This type of information provides one component of the complete array of data constituting functional genomics, with cellular and phenotypic function being provided by complementary techniques such as transcriptional mapping and genetics.

Molecular function can be understood at different levels of resolution. A low-resolution description will identify the class of function--the protein is an enzyme, or a DNA-binding protein, or a cell receptor, for example. At medium resolution, the primary function can be identified--it's a beta-lactamase enzyme, or a protease, or a DNA-binding protein that acts as a repressor, for example. At high resolution, the primary specificity is known, e.g., the protein is a beta-lactamase effective against penicillins rather than cephalosporins, or the repressor acts on the lac operon. Methods of deriving function from structure strive to get to as high a resolution as possible.

Goals include: (1) the assignment of function to "hypothetical" proteins and proteins with only a cellular or phenotypic function identified; (2) the assignment of functional differences within a sequence family; and (3) the interpretation of data on disease associated single nucleotide polymorphisms (SNPs). Experience to date with assigning function to experimental structures of "hypotheticals" has been encouraging, with medium resolution descriptions usually obtained in a straightforward manner. SNP functional analysis and the assignment of function within sequence families usually depends on the quality of model that can be built, in addition to the functional analysis tools available. There are so far few examples of structures of proteins with SNPs analyzed, but it is expected that in the medium- to high-resolution range of homology modeling, useful information will be obtained.

For assignment of function within families, the critical factor becomes the quality of the model. Experience with testing of modeling methods at CASP (Critical Assessment of Techniques for Protein Structure Prediction) shows that "threading" methods are not yet developed to the point where they can reliably identify the fold a sequence adopts. Further, when the methods are successful, errors in the resulting models are generally too high to allow more than low resolution functional information to be derived. These results, and data from homology modeling tests, strongly suggest that to be effective, structural genomics must sample sequence space such that every remaining sequence is at least 30% and preferably 40% identical to that of sample structure.

There are a number of computational tools already developed that are useful for deducing function from structure, and more are on the way. Surface electrostatics, as displayed, e.g., by the program GRASP (A. Nicholls, K. Sharp, and B. Honig, Columbia University) can usually identify DNA and RNA binding sites, and occasionally, other functional features as well. Janet Thornton's group (University College, London) has shown that small ligand binding sites are almost always associated with the largest depressions in the surface of a protein, and these can be found automatically or by visual inspection. Examining the extent of residue conservation of a family of sequences displayed on the surface of a structure has emerged as a powerful method of finding more general functional features, particularly protein-protein interaction sites. Several groups are investigating the extent to which three-dimensional catalytic motifs can be catalogued and used to identify the catalytic function of new structures. Finally, the methods developed in drug design to identify potential lead compounds are expected to be applicable to deducing ligand-binding specificity.

Overall, results so far are most encouraging, and even without further improvement in modeling methods, large-scale structural genomics will provide a wealth of detailed functional information. Emerging techniques will provide additional tools for defining function. However, in many cases, other experimental data will be required to confirm hypotheses and to extend the functional description to high resolution.


Session 3: Sequence Family Organization

This session reviewed what is known about protein sequences organized into families and how properties of protein families can be used to prioritize families for structure determination and to pick one or more family representatives as structural targets.

A critical requirement for the selection of targets for structural genomics is the comprehensive organization of protein sequences into families. Key international research groups working on large-scale clustering of protein sequences contributed to the workshop by presenting their scientific methodology and by sharing their protein family collection on the Protein Structure Initiative (PSI) Web site (http://www.structuralgenomics.org Link to external Web site).

The providers of organized protein families, family attributes, and pre-selected targets were (presenter in italics):

COGs
Yuri I. Wolf, Roman L. Tatusov, Michael Y. Galperin, Eugene V. Koonin--National Institutes of Health, Bethesda, Maryland

Pfam
Alex Bateman, Ewan Birney, Richard Durbin, Sean Eddy, Kevin Howe, Erik Sonnhammer--Sanger Center, Cambridge, UK; Washington University, St. Louis, Missouri; Karolinska Institute, Stockholm, Sweden

Picasso
Liisa Holm--EMBL-EBI, European Bioinformatics Institute, Cambridge, UK

PIR
Winona C. Barker, Geetha Y. Srinivasarao, Lai-Su Yeh, Cristopher R. Marzec, J.S. Garavelli, Friedhelm Pfeiffer, Cathy Wu, Hongzhan Huang--Georgetown University Medical Center, Washington, D.C.

ProClass
Cathy H. Wu, Hongzhan Huang--The University of Texas Health Center at Tyler, Texas

ProDom
Florence Servant, Jerome Gouzy, Florence Corpet, Daniel Kahn--INRA/CNRS, Toulouse, France

ProtoMap
Elon Portugaly, Nathan Linial, Golan Yona, Michal Linial--Hebrew University, Jerusalem, Israel

Systers
Antje Krause, Martin Vingron--German Cancer Research Center, Heidelberg, Germany

Groups used different methods for sequence alignment and clustering, resulting in different family collections organized at different levels of granularity. The collections are not strictly comparable, as different size protein sequence collections were used, as well as different criteria for family membership and different levels of automation, including extensive hand curation for some of the collections. Details of the methodology are available directly on the groups' Web sites, and an overview is on the PSI Web site at http://www.structuralgenomics.org Link to external Web site.

The Human Genome Project and other sequencing projects have so far provided about 400,000 protein sequences, of which about 200,000 are unique at the level of 90% sequence identity. These sequences are partly derived from tens of complete microbial genomes, the complete C. elegans genome, as well as a rapidly growing fraction of all proteins from the human genome and other model organisms. Although we are far from complete coverage, this collection is an extremely rich source of targets for structural genomics.

The family collections and pre-selected targets submitted by the different groups (see the PSI Web Site Link to external Web site) give a first overview of the scope of the problem. A key issue is the overall scale of the effort. In spite of the incompleteness of current sequence collections and the differences in methodology, clear trends are apparent. Choosing different levels of granularity, i.e., imposing different quality criteria on models built by homology from representative, experimentally determined structures, leads to different numbers of family representatives as targets (Fig. 1).

Total effort versus model quality. The graph can be used to estimate the relative savings that come from solving representative structures and then modeling their neighbors by homology, rather than solving all structures. (The band represents values available from different approaches to organizing various protein sequence collections--see below.) In choosing the desired level of savings, there is a tradeoff between model quality (increases toward the right) and savings (increases toward the left). A reasonable initial specification sets desired model quality at the minimal sequence agreement between model and template at 35% identical residues (indicated by a dot). 

Figure 1:

Total effort versus model quality. The graph can be used to estimate the relative savings that come from solving representative structures and then modeling their neighbors by homology, rather than solving all structures. (The band represents values available from different approaches to organizing various protein sequence collections--see below.) In choosing the desired level of savings, there is a tradeoff between model quality (increases toward the right) and savings (increases toward the left). A reasonable initial specification sets desired model quality at the minimal sequence agreement between model and template at 35% identical residues (indicated by a dot). Making several assumptions about the degree of family unification, this corresponds to solving the structure of about 1 in every 10-16 proteins (relative effort 0.1-0.06). How many structures does this correspond to in absolute terms? An effort of 1.0 corresponds to solving a 3-D structure for every known protein (currently about 200,000 after removal of one of every pair above 90% sequence identity); an effort of 0.06 therefore implies solving 12,000 structures. Over the next few years, one may expect that the total number of known proteins sequences across important model organisms will rise 5- to 10-fold, to 1 to 2 million, while the level of family unification as a result of sequence comparison will probably also increase substantially--two opposing effects, with currently unknown numerical outcome on the estimated total effort for a comprehensive structural genomics project.

Details: Effort is defined as the ratio r = (number of structures solved experimentally)/(total number of protein sequences modeled in 3-D). Model quality is measured by the minimum required sequence agreement, in percent identical amino acids (% idAA), between modeled and experimental structure (see text); this measure can be translated to accuracy in atomic positions ("rmsd') using well-established methods. The graph is based on preliminary data from different protein family collections. Each collection starts from a different total number of sequences (equivalenced in this figure at an effort of 1.0, for comparison) and uses a different method of family unification (apparent from the width of the band). Details of the family collections are at http://www.structuralgenomics.org Link to external Web site and at the groups' own Web pages. Graph compiled by D. Vitkup, E. Melamud, J. Moult, and C. Sander (unpublished).

For reasonable parameters, the total effort in "number of structures solved to cover all currently known protein families" is on the order of 12,000 structures (for a homology building 'distance' corresponding to 35% identical residues between model and template).

Although there are unresolved issues with respect to special protein classes, such as the difficulty of crystallization of membrane proteins, the scale of the problem of large-scale structural genomics is emerging in quantitative detail. Most of the human genome sequence is likely to become available next year. Consequently, the sequence analysis research groups will probably be in a position to determine the definitive number of target structures needed for exhaustive coverage of all human protein families by the middle of the year 2000.


Session 4: Current and Planned Structural Genomics Initiatives

This session provided an overview of ongoing and planned structural genomics projects in the United States, Europe, and Japan.

The structural genomics initiatives fall into two parts: "pilot projects" underway in various groups in the United States and developing projects elsewhere in the world. Six U.S. pilot projects were represented. David Eisenberg (UCLA) described ongoing work on the aerobic bacterium Pyrobaculum aerophilum. The experimental part of this project is focused on assigning folds to gene products and determining structures of medically relevant proteins. Medical relevance is identified by homology with protein sequences in the National Center for Biotechnology Information's Online Mendelian Inheritance in Man database, and a surprising 24% of open reading frames are so related. Theoretical aspects of the project focus on predicting which open reading frame represent new folds, the prediction of protein-protein interactions, and pathway identification.

Sung Ho Kim (UC Berkeley) described work in his group on the hyperthermophilic archaeon Methanococcus jannaschii. The advantages of this organism are its "deep rooted" position in the phylogenetic tree of organisms, and the relative ease of protein purification and crystallization from hyperthermophiles.

Bill Studier (Brookhaven National Laboratories) described work on a joint BNL/Rockefeller project focused on proteins from yeast. Yeast was chosen because of the high proportion of human-related proteins and the possibility of involving the large community of scientists already working in this area. The initial emphasis is on testing the feasibility of high-throughput methods.

Guy Montelione (Rutgers University) described the only U.S. project placing more emphasis on NMR as compared with X-ray crystallography as the primary experimental method of structure determination. Targets are selected on the basis of relevance to human disease, broad conserved metazoan genes, and genes of human pathogens. Rather than focusing on one organism, a set of homologous proteins from different sources are expressed.

Paul Bash (Northwestern University) described a project aimed at broad coverage of structure space, particularly obtaining a "basis set" of protein folds spanning structure and sequence space and a set of structures with a wide phylogenetic distribution--in some sense universal proteins. Goals are discovering new folds, providing a set of structures to be used for molecular replacement in crystallography, and obtaining an understanding of protein evolution. The group has contracted out the production of protein and finds this a satisfactory and economic solution for that stage of the process.

John Moult (CARB, University of Maryland) reported on a joint CARB/TIGR project to obtain structures for the unannotated open reading frames from Haemophilus influenzae (HI). HI was chosen because it is the smallest free-living genome available. The primary goal of the project is to directly and indirectly obtain information about the function of these "hypothetical" proteins, and thus, speed discovery of the complete biochemical processes in a simple cell. To this end, an important part of the project is to make the information and materials obtained available to a wide community of experimentalists.

A total of approximately 12 structures were reported as already solved in these pilot projects. Analysis of these structures is generally still in progress, but for the proteins where the function was previously unknown, some functional insight has been easily obtained in all but one instance. Many more proteins are in the pipeline, and we can expect a flood of structures in the next year. All groups reported an encouraging level of success with expression, purification, and crystallization. Although it is still early to draw definite conclusions, results from these pilot projects are very promising. It was apparent that each group has taken a quite different view of how to choose targets.

On the international landscape, Udo Heinemann (Max Delbrueck Center, Berlin) described the "Protein Structure Factory". This is a consortium of groups, funded in the Berlin region by the German Ministry for Research (BMBF), with approximately $20 million over 5 years, and an expectation of producing at least 100 structures. Among others, thermostable proteins will be studied, primarily selected to represent new folds, antibiotic targets, or translation components. Proteins will be expressed in E. coli, and there will be a strong emphasis on automation and parallel processing at all stages--particularly for protein production and crystallization. Dedicated beam facilities will be built at the Bessy II synchrotron in Berlin.

Shigeyuki Yokoyama (University of Tokyo) described three Japanese initiatives in structural genomics. Most advanced is a very large-scale investment in NMR facilities to be partly used for determining the structure of proteins from mice. This project is associated with the Genomic Sciences Center (GSC) established by the Institute of Physical and Chemical Research (RIKEN) in Yokohama, Japan. In addition to human genome sequencing, the effort includes a mouse genome project. Currently, 20,000 mouse clones are available, and proteins are being expressed in a novel cell-free system, making possible the production of isotope labeled material needed for NMR at low cost. It is expected that approximately 100 structures a year will be obtained.

A second project will focus on full functional and structural characterization of the genes from a thermophilic organism, T. thermophilus, currently being sequenced. The aim is also to obtain approximately 100 structures a year, primarily using X-ray crystallography, utilizing the new Spring 8 synchrotron.

Finally, a third planned project would determine structures of "hypothetical" proteins from the aerobic thermophilic organism Pyrococcus horikoshii.

Tom Blundell (Cambridge, UK) outlined the Structural Biology Industrial Platform (SBIP), a consortium of 15 companies, including some of Europe's largest pharmaceutical industries. Members of this organization will work with each other, the European Commission, and research centers in Europe to promote structural biology training and development. Its activities will include structural genomics. Philippe de Taxis du Poet of the European Commission described the expected plan for "Framework 5" European funding, where approved projects will have demonstrable potential industrial or society benefits. It is likely that several groupings in different European countries, including academic and industrial researchers, will put forward structural genomics proposals under the European Union funding program.

Marvin Cassman (NIGMS) asked the workshop participants how they envisioned structural genomics research changing in response to the initiation of a NIGMS support program, since research in the field is already progressing rapidly. He pointed out that several workshops and meetings have taken place over the last year, including two sponsored by NIGMS. As a result of the most recent one (the NIGMS planning meeting in November 1998), the Institute plans to announce a national support program for research centers to serve as pilots for a subsequent large-scale program. The pilots will have three goals: (1) to develop technology and to establish the milestones critical for a high-throughput, large-scale structural genomics project; (2) to develop target selection schemes; and (3) to provide the preliminary results necessary to establish the importance and relevance of this initiative. He asked the group to discuss the scientific benefits of structural genomics research projects and the need for a concerted effort.

Barbara Skene (Medical Research Council, UK) described the process by which agreement was reached on the organization of the Human Genome Project, particularly a Wellcome Trust "blueprint for the principles of cooperation." This document may act as model for the needed agreement in structural genomics, particularly with regard to the organization of who does what, and the speedy release of information. David Eisenberg (UCLA) emphasized that it is essential to establish principles for data release early in the development of structural genomics.


Session 5: Protein Structure Initiative Web Site

This session provided a demonstration of the PSI Web site Link to external Web site, including search facilities for family prioritization and target selection based on user-supplied criteria and a service for registering ongoing structural work in research laboratories. The site was commissioned by NIGMS and was built by D. Vitkup, E. Melamud, J. Moult, and C. Sander. The PSI Web site's mission is to facilitate the ability of the structural community to select interesting and important targets, foster cooperation between structural and molecular biologists, and assist funding agencies in prioritization of structure determination projects.

The PSI Web site is designed to facilitate the steps needed for efficient coverage of protein structural space. A typical experimental effort would need to:

  • Organize known protein sequences into families
  • Select family representatives as targets
  • Solve the 3-D structures of targets by X-ray crystallography or NMR spectroscopy
  • Build models for other proteins by homology to solved 3-D structures
  • Be aware of ongoing work in other groups so as to avoid duplication of effort

The primary data currently (early 1999) on the site are provided by the eight sequence family groups listed in Section 3. In addition to sequence family sets, each group has provided a set of 35 suggested targets. The sequence family data are organized into a relational database. Users can search the database of sequence families and suggested targets using different sets of protein family attributes, and targets can be selected from chosen families, by, for example, organism. The currently available search criteria are:

  • Large protein families
  • Families with a predefined limit on sequence length
  • Families with representatives in a minimum number of species
  • Families with a human member
  • Families without a member of known structure
  • Non-transmembrane families
  • Families with representatives in all three main domains of life (prokaryotes, archaea, eukaryotes)

These criteria can be used together with a prioritization scheme to query protein space for potentially important targets consistent with a researcher's interests. For each selected target, a "Sequence Status" page provides information on the target, including functional annotation (from the SWISS-PROT and PIR databases), presence of the target in protein families (from different family classification schemes), and progress in structure determination of the target by any registered group.

In addition to selection of structural targets, the other important goal of the Web site is to promote cooperation between groups working in the area of structural genomics. The Web site provides functionality for experimental groups and individual researchers to register and create "group" and "personal" pages.

After registration on the PSI Web site, an experimental group can use the group page to record progress in structure determination of their targets. An interface is provided to change the status of experimental work on a sequence from cloning to X-ray/NMR data collection to structure determination and to add notes for each of these stages. Additionally, a user will be alerted if a sequence target they have selected or a homolog (a member of the same sequence family) is currently under investigation by another group.

Individual researchers can use their personal pages to define lists of sequences they are interested in and routinely track progress for those sequences.

Plans are in place to further develop the target selection and tracking Web site into a community resource for structural genomics by:

  • Providing a mechanism for regular updates for groups participating in providing family collections
  • Providing additional criteria for target selection (medical relevance, novel folds, etc.)
  • Providing access to structural models
  • Providing a mechanism for user annotation

Another useful resource available on the Web for structural genomics is http://presage.berkeley.edu ( full article Link to external Web site) authored by Steven Brenner, Derren Barken, and Michael Levitt.


Session 6: Coordination and Cooperation

This session addressed the opportunities for open exchange of information and coordination of structural genomics efforts, both in the United States and on a global scale.

Participants involved in shaping projects or funding programs gave the clear impression of an impending increase in overall effort starting in 1999. NIGMS is considering the formation of a Protein Structure Initiative; sizable funds have been committed in Japan and Germany; both the Commission of the European Union and the UK-based Wellcome Trust may evolve funding initiatives in support of structural genomics. There is also a clear tendency toward larger projects on a scale involving several research groups or the formation of larger centers. Overall, the numbers of structures solved by the new initiatives and by the many existing structural biology research laboratories is expected to begin to increase significantly in 1999.

Part of the driving force for structural genomics comes from rapid progress in genome sequencing projects. There is no shortage of targets for structure determination. Computational biologists have sufficient tools in hand to organize all new sequence information, provide criteria for reasonable prioritization, and build 3-D models by homology to multiply the impact of solving individual structures.

Participants in the workshop for the most part expressed a strong desire for coordination at least on the level of active information exchange about ongoing initiatives and about specific target lists worked on in the groups' laboratories. At the same time, exchange of information would not constrain projects or project groupings in scientific focus and target selection. The overall benefit of avoiding duplication of effort is widely recognized.

Many participants also spoke up in favor of free sharing of solved protein structures by depositing the coordinate data sets and related data in publicly accessible databases, in particular, the Protein Data Bank, which for several decades has been the internationally recognized primary repository of protein structure information. Although the workshop was not a platform for policy recommendations, there was strong encouragement for free publication of protein structure data and prompt release of coordinates. Both journal editors and funding agencies, are in principle, in a position to promote such open, international exchange.

As was pointed out by several participants, the Bermuda agreement reached by the human genome sequencing community is a possible model for international cooperation, for avoiding duplication of effort, and for open sharing of scientific information. Suggestions were made to reconvene later in 1999 or early in 2000 to draft a similar cooperation agreement or a set of recommendations for structural genomics.

The year 1999 is likely to end with structural genomics clearly visible on the international scientific arena. The workshop ended with the explicit hope that within the next 5 years, up to 10,000 key structures would be solved covering a significant fraction of protein families from key organisms, including Homo sapiens sapiens.


Participants, Observers, and Federal Agency Staff in Attendance

Invited Participants:

John Moult--Center for Advanced Research in Biotechnology,
     University of Maryland (co-chair)
Chris Sander--Millennium Predictive Medicine, Inc., and
     Massachusetts Institute of Technology Center for Genome Research (co-chair)
Winona Barker--Protein Information Resource, National
     Biomedical Research Foundation
Paul Bash--Northwestern University Medical School
Tom Blundell--University of Cambridge, United Kingdom
Stephen Burley--Rockefeller University
David Clayton--Howard Hughes Medical Institute
Phillipe de Taxis du Poet--European Commission, Bruxelles,
     Belgium
Richard Durbin--Sanger Centre, United Kingdom
David Eisenberg--University of California at Los Angeles
Leigh English--Monsanto Company
Paula Fitzgerald--Merck Research Laboratories
Udo Heinemann--Max-Delbrueck-Centrum fuer Molekulare
     Medizin, Germany
Wayne Hendrickson--Columbia University
Wim Hol--University of Washington
Liisa Holm--European Bioinformatics Institute, European
     Molecular Biology Laboratory, United Kingdom
Barry Honig--Columbia University
Daniel Kahn--Laboratoire de Biologie Moleculaire des
     Relations Plantes-Microorganismes, France
Sung-Hou Kim--University of California at Berkeley
Eugene Koonin--National Center for Biotechnology Information,
     National Institutes of Health
Eaton Lattman--Johns Hopkins University
Michal Linial--University of Jerusalem, Israel
Gaetano Montelione--Rutgers University
Manuel Navia--Althexis Company
Gregory Petsko--Brandeis University
Andrej Sali--Rockefeller University
Barbara Skene--Wellcome Trust, United Kingdom
William Studier--Brookhaven National Laboratory
Tom Terwilliger--Los Alamos National Laboratory
Janet Thornton--University College, London, United Kingdom
Martin Vingron--German Cancer Research Center (DKFZ),
     Heidelberg, Germany
Dennis Vitkup--Protein Structure Initiative Web Site
     Manager, Massachusetts Institute of Technology Genome Center
Cathy Wu--University of Texas Health Center at Tyler
Shigeyuki Yokoyama--University of Tokyo, Japan

Observers:

Willa Appel--New York City Partnership Policy Center
Steve Anderson--Rutgers University
Geoff Barton--European Molecular Biology Laboratory, European
     Bioinformatics Institute, United Kingdom
Helen Berman--Rutgers University
Steven Brenner--Stanford University
Stephen Bryant-- National Center for Biotechnology Information,
     National Library of Medicine, National Institutes of Health
Andrew Coulson--University of Edinburgh, Scotland
George DeTitta--Hauptman-Woodward Medical Research Institute
Terry Gaasterland--Rockefeller University
Michael Gribskov--University of California, San Diego
Osnat Herzberg-- Center for Advanced Research in Biotechnology,
     University of Maryland
Andrzej Joachimiak--Argonne National Laboratory
Yutaka Kawarabayasi--National Institute of Technology and Evaluation,
     Japan
Seiki Kuramitsu--Osaka University, Japan
Robert Ledley--Georgetown University Medical Center
Eugene Melamud-- Center for Advanced Research in Biotechnology,
     University of Maryland

Federal Agency Staff:

NIH Center for Scientific Review:
     Marjam Behar
     Nancy Lamontagne
     Elliot Postow
     Arnold Revzin
     Don Schneider

Department of Energy:
     Charles Edmonds
     David Thomassen
     John Wooley

National Aeronautics and Space Administration:
     Steve Davison
     J. Patton Downey

National Cancer Institute:
     John Beisler

National Center for Research Resources:
     Karl Koehler
     Marjorie Tingle

National Human Genome Research Institute:
     Elise Feingold
     Adam Felsenfeld

National Institute of General Medical Sciences:
     Jim Cassatt
     Marvin Cassman
     Jean Chin
     Jim Deatherage
     Warren Jones
     Cathy Lewis
     Alisa Machalek
     John Norvell
     Peter Preusch
     Sue Shafer
     Bert Shapiro
     Janna Wehrle
     Marion Zatz

National Science Foundation:
     Paul Gilna
     Lee Makowski
     Kamal Shukla