The overarching goal of the Protein Structure Initiative (PSI) is to make three-dimensional (3D) atomic-level structures of most proteins readily obtainable from knowledge of their corresponding DNA sequences. The primary goal of the second stage of the PSI (PSI-2) is large-scale structure determination to maximize the coverage of protein sequences by structural information. This will be done by experimental determination of an estimated 3,000 structures. The process of target selection will undergo continuous evaluation and refinement as our understanding of protein sequence-structure relationships evolves. PSI-2 will primarily select targets from protein sequence families (so-called BIG families) that are structurally uncharacterized. Very large protein families with enormous phylogenetic variation, but limited structural coverage (so-called MEGA-families), will serve as the source of additional target sets, in order to explore the evolution of structural and functional diversity. New metagenomics data (META-families) from large communities of organisms, such as the human gut microbiome with its implications for human disease, or the environment, e.g. Global Ocean Sampling (GOS), will constitute an important source of novel targets for PSI-2. The biomedical targets and community proposed targets represent essential component of each Center target selection. The coverage of certain model organisms (both prokaryotes and eukaryotes) may also be considered in target selection. In addition to providing ~3,000 new experimental structures, PSI-2 will make available even broader structural coverage of protein sequence space (i.e., leverage) using computational homology modeling to provide structural information for the ever expanding database of related protein sequences.
The main goal of target selection in PSI-2 is to coarsely sample large protein families (Pfam and other families) with no structural representatives in PDB for broad structural coverage. The current goals include:
In order to optimize target selection and to maximize the biological and biomedical insights that can be derived from the PSI-2 efforts we will evaluate structural coverage of certain genomes or groups of targets (human-disease related microbiomes and metagenomes).
Protein sequence families targeted by PSI-2 will be prioritized on the basis of size, genome coverage, homology modeling coverage, and perceived biological and biomedical relevance.
Through the advice and participation of the PSI-2 Standing Subcommittee on Target Selection, a centralized target selection mechanism has been implemented within the four Large-Scale Research Centers to ensure target selection consistent with the goals of PSI-2. Structurally uncharacterized (or inadequately characterized) families are identified using well-established protocols developed among the four Centers. Initial rankings are based on family size (total number of sequences) and on family diversity (reflecting structural and functional complexity of the family plus consideration of the number of structures required to model most if not all proteins within the family). Furthermore, targets or target families that are deemed more appropriate for investigation by the six Specialized Centers are identified so that technologies and methodologies for these important classes of proteins (e.g. membrane proteins, protein-protein complexes, eukaryotic proteins, and recalcitrant proteins) can be developed and transferred to the general structural biology community.
The selection of target families is organized and managed jointly by the four Production Centers with the aims of:
Each Center prioritizes targets according to criteria that reflect their individual scientific interests and technical capabilities, including:
A given target family is, in most cases, assigned only to one Center. Each Center is allocated an approximately equal number of target families (or subfamilies), which are distributed using a rotating draft pick process (or equivalent strategy) from a consensus list of target families. Target families (BIG families, MEGA families, META families) are assigned and constantly re-evaluated from the well-characterized ensemble of Pfam, BIG and other protein sequence families as they became available from the genome sequencing efforts. Additional target families will be selected from other ensembles of protein sequence families identified on a consensus basis by bioinformatics staff members from each Center. Following assignment of a new target family, each Center is responsible for applying their own methodologies for selecting individual candidates for experimental structure determination from within each targeted protein domain sequence family. Sets of very large, diverse and biologically/medically important protein sequence families (MEGA and META families) are being analyzed for PSI-2 structural studies. Each Center is also responsible for applying their own methodologies for selecting individual targets for structure determination of biomedical targets and community targets.
The PSI and each Center provides weekly updates to TargetDB, the public database of selected families and structure determination candidates, and deposits protocols for protein sample production into PepcDB, the public repository of PSI-2 structure determination results.
The Target Selection Subcommittee has recommended a well-defined set of milestones and deliverables which have been incorporated in the PSI-2 Goals and Milestones Statement. It is further recommended that each Center should contribute a peer-reviewed publication summarizing their progress and highlighting biological/biomedical as well as technological/methodological contributions. These annual publications will provide a facile means of communicating overall progress plus important results, concepts, strategies, and ideas to the public. Rigorous assessments of the achievements of the entire PSI initiative should be undertaken regularly and published, to the extent possible, along with the annual publications of all Centers. Such publications would serve an important goal of providing citations for the activities of the PSI as a whole and those of each Center, and for structures that would otherwise not be the focus of a peer-reviewed publication.
The impact of PSI-2 may be assessed on an annual basis using quantitative criteria that include:
Although primarily focused on high-throughput experimental protein structure determination and methodology, contributions from PSI-2 Centers are distinct from those of conventional structural biology in that they also make extensive data and substantial resources available to the general scientific community. It is imperative, therefore, when publicizing PSI-2 that special efforts be devoted to highlighting all such resources that have been made available to and accessed by the community, including: expression vectors; expression clones; protein expression/purification protocols; purified proteins; experimentally determined protein structures; homology models computed from PSI-2 structures; homology modeling techniques; advances in laboratory information management systems; computer programs; robotics; gene cloning and protein expression/purification methodologies; crystallization strategies and protocols; experimental data sets for methods developers; comprehensive positive and negative data for data mining; and X-ray crystallography and solution NMR structure determination methods.
Steering Subcommittee on Target SelectionChair: Andrzej JoachimiakMembers: Guy Montelione, Ian Wilson, Stephen Burley, Andras Fiser, Adam Godzik, Christine Orengo, Burkhard Rost, Jerry LiAdvisors: David Baker, Steve Brenner