PSI Data Management Workshop

July 10-11, 2003

The National Institute of General Medical Sciences organized the first workshop on data management for the Protein Structure Initiative (PSI).  The workshop, held on July 10 – 11, 2003 at NIH, brought together experts in the fields of bioinformatics and biological data management from the nine PSI research centers, other structural genomics research laboratories, structural genomics companies, genome sequencing centers, and the Protein Data Bank.  The goal of the workshop was to promote collaboration and resource/knowledge sharing among the various groups and to explore the feasibility of centralizing and standardizing some of the data collected by the structural genomics centers.  The workshop consisted of three major components:  reports from representatives of the nine PSI centers, invited talks from representatives of other research centers, and topic discussions.  The workshop agenda, presentation abstracts, and discussion summaries can be accessed from the PSI webpage.

At the workshop, impressive progress made by the PSI centers during the first two years of operation was apparent.  The data management effort had evolved from simply data collection and storage to a comprehensive effort that includes information integration and mining, target selection and prioritization, experiment design and tracking, automated data collection and processing, and automated dissemination and report generation.  Data management is becoming a central part of the PSI centers for strategic decision making as well as day-to-day research operation.  Many of the centers have been putting significant resources and personnel, up to a third of the total budget, into their data management components, in recognition of the significance of data management for the overall success of their centers.  Some centers argued that the input for data management should be kept at a more modest level to avoid compromising experimental input.  It will be interesting to see how these two approaches distinguish themselves in the next few years.

During the discussion sessions, a number of topics were brought up by the discussants.  Informatics experts raised the issue of difficulty in collecting detailed experimental procedures and output information from the experimentalists.  Others provided their solutions for extensive bar coding, remote access, and automated data entry and collection with minimum human intervention.  The merit of centralization versus decentralization was also debated.  Many participants argued that a centralized monolithic database for the entire PSI could provide efficiency and uniformity, but would be less valuable for providing all the needs of various centers working on different systems using different approaches.  A federated architecture that allows local databases to be fully tailored to the needs of particular centers and enforces communication with the central repository to enable data mining across centers would be a preferable alternative.  This discussion brought up the issue of standardization for data exchange and communication.  The TargetDB at PDB was constructed initially to minimize overlap of effort from the PSI centers on similar targets.  It has turned out to be very useful and attracted many other structural genomics laboratories around the world to make their own contributions to it.  The workshop participants recommended an expansion of the TargetDB as a central repository for the protein expression and crystallization data collected by the structural genomics labs to allow creative mining by the scientific community.  A planning committee has been established to formulate the data definition, standards, and requirements for this central data repository.

The data management workshop accomplished its goals through active participation of everyone who attended the meeting.  It provided a good avenue for productive information exchange.  As one workshop participant told the NIGMS staff, it was an “exiting and informative workshop”.

Workshop Agenda

Thursday, July 10
8:30   Opening Remarks
John Norvell, Director, Protein Structure Initiative
8:40     Session I:  PSI Center Presentation I
Session Chair: Andrej Sali, UC San Francisco

8:40     Berkeley Structural Genomics Center
9:00     Center for Eukaryotic Structural Genomics
9:20     Joint Center for Structural Genomics
9:40     Midwest Center for Structural Genomics
10:00   New York Structural GenomiX Research Consortium
10:20   Coffee Break

10:40   Session II:  PSI Center Presentation II
Session Chair: Helen Berman, Protein Data Bank

10:40    Northeast Structural Genomics Consortium
11:00    Southeast Collaboratory for Structural Genomics
11:20    Structural Genomics of Pathogenic Protozoa Consortium
11:40    TB Structural Genomics Consortium
12:00    General questions for the center presentations
12:30    Lunch
1:30     Session III:  Invited Talk I
Session Chair: Paul Adams, Lawrence Berkeley Laboratory

1:30     The Protein Data Bank and Data Management
Helen Berman, Rutgers University

2:00     BioMagResBank (BMRB)
Eldon Ulrich, University of Wisconsin-Madison

2:20     Dynamic Genomic Database Management:
Strategies for Managing an Evolving Database Structure
Scott Smith, Washington Univ. Genome Sequencing Center

2:50     E-HTPX and SPINE Data Model for Protein Production
Kim Henrick, European Bioinformatics Institute

3:20     Coffee Break

3:50     Session IV:  Invited Talk II
Session Chair: Eldon Ulrich, U. Wisconsin-Madison

3:50     Data Management in a High-Throughput, Science-Based Genome Center
Steven Salzberg, TIGR

4:20     Harvesting, Tracking, and Supplying Key Data for Structure Based Databases
Howard Hackworth & Jorg Hendle, Structural GenomiX

4:50     Data Management for Target Selection, Protein
Production and Crystallization at the OPPF
Robert Esnouf, Oxford University

5:20     Adjourn 
6:30     Dinner
 (Faryab Restaurant, Bethesda)
Friday, July 11

8:30     Session V:  Data Management Topic Discussion
Session Chair: Charles Edmonds, NIGMS

8:30     Data Content & Requirements
Helen Berman / Paul Adams

9:20     Data Acquisition  
Wladek Minor / Eldon Ulrich
10:10   Coffee Break

10:30   Database Functionality, Structure, & Tools
Andrej Sali / Mark Gerstein

11:20   Data Dissemination, Exchange, & Publication 
Adam Godzik / Wladek Minor
12:10   Lunch 

1:10  Discussion on Target Selection
Adam Godzik / John Norvell

2:00     Summary of Topic Discussion and Action Plan  
Session Chair:  Mark Gerstein

3:15     Workshop Adjourn

Abstracts and Summaries

Berkeley Structural Genomics Center:
Target Selection, Unselection, and Data Management

The goal of the BSGC is to obtain a near-complete structural complement of two minimal genomes, Mycoplasma pneumoniae and Mycoplasma genitalium.  Mycoplasma proteins of unknown structure which are predicted to be experimentally feasible are selected as targets, along with similar proteins from other bacteria.  At weekly intervals, sequences of newly solved structures from outside our center are compared with our targets.  This may result in our target being "unselected", if continuing to work on solving the structure would result in duplication of effort.  Data management issues, including integration of the core LIMS with other databases, will also be discussed.

Center for Eukaryotic Structure Genomics:
Sesame Project Management System for Structural Genomics

Sesame is a web-based software package ( designed to organize and record data relevant to complex scientific projects, launch computer-controlled processes, and help to decide about subsequent steps on the basis of information available.1 The Sesame system is based on the multi-tier paradigm, and it consists of a framework and application modules that carry out specific tasks, and can support both high-throughput centers and small labs (down to individual users). For structural genomics, our goal is to cover the whole process, from target selection through data deposition. The related data have been defined to be a superset of the data to be deposited into RCSB (PDB and BMRB). In general, a special attention is given to the GUI design, to simplify the entry of complex data, to maximize the quality of the data, and to streamline the use of the modules. Whenever possible, the data are harvested from instruments (or from the files produced by them). For security, access to the resources (data, processes, can be expanded to instruments, robots, etc.) is controlled by access privileges set by the owners of the resources. The Sesame system is scalable, and it can support any number of users, labs, collaborative groups, sites, and centers.

Joint Center for Structural Genomics:
Data Centric View of the Structural Genomic Project

JCSG is developing a fully automated, high throughput structure determination pipeline, relying on robotic substations handling simultaneously large number of targets. An integral part of this project is a JCSG database, tracking ~300 parameters which are automatically deposited into the database, either via WEB forms integrated with experimental protocols or by exchange of information with local databases maintained by individual experimental groups. The JCSG database was implemented in ORACLE on a SUN server, using dynamic database design schema.  After two years of development the JCSG database is now in full production and it quickly became a primary source of information exchange within JCSG. Publicly available interface provides tracking information on all JCSG targets, internal pages, with over 20 specific reports in various formats (from html, EXCEL to interactive forms), are used not only for tracking, but for day to day data management within JCSG.

The biggest challenge for the JCSG database it to provide needed functionality despite changes in expectations from the experimental collaborators in the grant. Such expectations started to evolve quickly since the system became fully operational. Current usage of the database is evolving from simple tracking and “passive” data management to full support of the experimental process, with experimental collaborators relying on the database to provide information needed for day-to-day planning of experiments. Data deposited in the database in the last two years is the basis for scientific research and analysis using, among others, data mining technologies.

Midwest Center for Structural Genomics

The Midwestern Center for Structural Genomics (MCSG) data management system is being developed as a set of independent and distributed databases collecting data at multiple sites and at different stages of MCSG structure determination pipeline. These databases are linked to the central MCSG database. The data management is organized according to the basic database paradigm: data in, information out.  The communication between databases is accomplished through central database using ODBC and http protocols. Depending on the type of the problem, the exchange of data is performed on demand, daily and weekly. The local, specialized databases exchange data with laboratory equipment, software applications and WWW sites. The work on the integration of common laboratory equipment and robotics with laboratory information management systems (LIMS) is in progress. Substantial effort is devoted into automatic capture of complete experimental data and minimization of the manual data entry. The internal consistency rules of the databases are in the process of development and refinement. In addition to organization of workflow, it is expected that data mining of accumulated experience will be used to optimize the efficiency of experimental protocols. The central database makes available to the public using web interface data generated in the MCSG structure determination pipeline.

New York Structural GenomiX Research Consortium

A brief overview will be presented about the organization of data management within the Consortium, the applied programs, databases, database softwares and other internally developed or externally linked programs for annotation, bioinformatics analysis and structure modeling. Special attention will be paid to explain the semi-automatic setup of our LIMS that reflects the fact that certain, but not all parts of the structure determination efforts are centralized and fully automated while some parts require manual intervention. We will discuss the advantages and bottlenecks we experienced in running and developing the system.

Northeast Structural Genomics Consortium

We present the current version of the SPINE system for structural proteomics.  SPINE is available over the web at It serves as the central hub for the Northeast Structural Genomics Consortium, allowing collaborative structural proteomics to be carried out in a distributed fashion. The core of SPINE is a laboratory information management system (LIMS) for key bits of information related to the progress of the consortium in cloning, expressing and purifying proteins and then solving their structures by NMR or X-ray crystallography. Originally, SPINE focused on tracking constructs, but, in its current form, it is able to track target sample tubes and store detailed sample histories. The core database comprises a set of standard relational tables and a data dictionary that form an initial ontology for proteomic properties and provide a framework for large-scale data mining. Moreover, SPINE sits at the center of a federation of interoperable information resources. These can be divided into (i) local resources closely coupled with SPINE that enable it to handle less standardized information (e.g. integrated mailing and publication lists), (ii) other information resources in the NESG consortium that are inter-linked with SPINE (e.g. crystallization LIMS local to particular laboratories) and (iii) international archival resources that SPINE links to and passes on information to (e.g. TargetDB at the PDB). We describe in detail one federated resource, SPINS, a database for managing NMR information. We also describe a number of data mining analyses on the data stored in SPINE, using decision trees and other approaches.

Southeast Collaboratory for Structural Genomics:
Integrated Data Management and Feedback System at SECSG

An integrated data management and feedback system is near completion at SECSG.  The objectives of the system are: (1) To provide accurate documentation on how an experiment was carried out. (2) To archive all data, include failures, obtained during an experiment.  (3) To provide a central database containing all information needed to describe the gene to structure experiment.  (4) To provide a central directory of all events.  (5) To provide communication between the Center’s various core research groups.  (6) To provide views of the massive data/results in multi-dimensional human readable form.  (7) To provide automatic structure deposition to the Protein Data Bank. (8) To correlate Center’s data/results through automated bioinformatics data mining techniques with data generated elsewhere.

To build such a system, databases were first established and refined in each of the individual research cores: protein production, X-ray crystallography and NMR.  The bioinformatics/information technology core then provided the necessary wrappers to allow communication between the individual databases and the central database through an XML based information exchange layer that is independent of operation system and/or database.  Web based interfaces have also been developed and are operational.  Various bioinformatics software/tools have also implemented on a local Linux cluster, which allows data mining and correlation of results obtained locally and/or elsewhere.  The Center also acts as official mirror sites for the IBM Teiresias software and the Weizmann Institute’s OCA and GeneCards packages.   Progress on this integrated system will be discussed.

Structural Genomics of Pathogenic Protozoa Consortium

The Structural Genomics of Pathogenic Protozoa (SGPP) consortium includes at least 12 units spread over 6 sites in 3 states. These units store data in several different forms. For target selection, targets are identified using Microsoft's SQL Server with Access and web front ends; domain predictions are in flat files on Linux clusters, with a MySQL database under construction; large volumes of two-hybrid data will need to be accessed from Macintoshes. Protein production data is recorded in Access and lab notebooks at one site; in Excel and PowerPoint at another. We are now moving this second site to a more standardized data management system. Crystallization screening produces compressed image archives from a MySQL database; the crystal growth lab uses Access, MySQL/XML and Oracle databases as well as Excel. Data structures for crystal testing and annealing are still under development. Each of the beamlines has their own internal database for X-ray data collection. Structure determination involves a collection of well-structured flat files. For purposes of reporting status to the public and for tracking more detailed progress internally, SGPP is now gathering summary data from these local data sources into a central Access/MySQL database, with scripts to export status information to the SGPP website and to the standard XML format file. The plan is to replace this with a more scalable system to allow consortium-wide data mining queries.

TB Structural Genomics Consortium

The TB Structural Genomics Consortium (TBSGC) is unique among all the PSI P50 centers because of its sheer size and diversity of the participating groups. The autonomous framework under which our consortium functions makes archiving data and its dissemination extremely challenging. Consequently, the present LIMS setup at the TBSCG has evolved after major revisions to be more convenient and easy to use in order to encourage TBSGC members to submit data. To accomplish this we have recently introduced a new module, the EMT (Experimental Management Tool) which, together with the XTAL_DB module (for gathering crystallization experiment information) allows our system to collect and archive data from targeting, cloning and protein production to crystallization.

Our current online notebook has an embedded tracking module that allows us to track the history of a target, including whether any material has been sent to a facility or other consortium member. This is important for maintaining due credit to members who worked on a target without having it targeted by themselves but also allows submission of all relevant information about the history of an ORF to structure to the PDB. We believe that once complete histories of targets can be displayed for members and the data utilized, members will be more encouraged to do the work to submit their data. With complete experimental histories of targets, we can provide analysis tools for members and assist them in choosing appropriate cloning, solubility, purification, and crystallization methods. We are also implementing new analysis algorithms to aid targeting through improved information on ORFs, like better function annotation, predicted fold etc.

We also have a method in place that collects information for high throughput experiments from the crystallization facility.  Once submitted, we provide this data to the public and users can search for and locate conditions that produced crystals. The cloning facility at Los Alamos is also submitting entire protocols to our database and we are currently implementing an automatic system for collecting and interpreting this data, which can then be utilized within the EMT.

The Protein Data Bank and Data Management

The PDB is committed to enabling rapid automatic deposition to the archive, providing a database for tracking the progress of PSI projects and making tools available for all aspects of data management.

In this presentation the system for collecting data from structure determination projects is presented followed by a discussion of the special requirements for structural genomics. These include an extended dictionary of data items for protein production, as well as the items involved in x-ray and NMR structure determination. Tools for automatically extracting these latter terms from computer output have been developed and are described. Finally the functionality of the TargetDB is presented.

All of the software developed by the PDB for data processing is freely available as open source and can be used for collecting information within each PSI project as well as for depositing the final results to the PDB. This functionality is demonstrated.

BioMagResBank (BMRB)

BioMagResBank (BMRB URL: is a repository for quantitative data on biological macromolecules derived from NMR spectroscopy. The current archive has over 2700 entries representing more than one million assigned 1H, 13C, and 15N chemical shifts for proteins, DNA, and RNA molecules and smaller collections of coupling constant, relaxation, and other data.  Over 1500 NMR constraint files in the PDB archive have been annotated, sorted by constraint types and software format, and made available from the BMRB web site through a query interface.  The archive is used by research groups to derive correlations between NMR chemical shifts and protein secondary and tertiary structure, protein disulfide bonding, and proline isomerization.  Analyses of large quantities of chemical shift data from BMRB also is used to investigate conformations of denatured proteins, and to develop tools for automating the protein NMR spectral peak assignment process.  The growing collections of NMR time-domain data sets, coupling constants, relaxation parameters, residual dipolar couplings, and other data will be used to further our understanding of biomolecular structure and to develop improved NMR spectral analysis algorithms.  In collaboration with the Collaborative Computational Program on NMR (CCPN) group, the European Bioinformatics Institute (EBI), and the European Structure Validation group, BMRB is normalizing the constraint data with the corresponding PDB coordinate entry and developing techniques for characterizing the constraint information.

BMRB is actively collaborating with the CESG and NESG in implementing direct depositions of all data related to structural genomics NMR derived structures.  The direct deposition process utilizes tools being developed in collaboration with the PDB to construct the ADIT-NMR deposition system for the acquisition of NMR derived data archived at the BMRB and the PDB.  The system is based on the new NMR-STAR version 3.0 dictionary, which is designed to be compatible with the mmCIF and pdbx dictionaries. 

We gratefully acknowledge our many fruitful interactions with members of the scientific community, the PDB, the CCPN, and the EBI. BMRB is supported by grant LM05799 from the US National Library of Medicine.

Data Content and Requirements Summary

The first part of the discussion concerned the contents requirements for PDB deposition. Dictionaries have been established for the description of the x-ray crystallographic and NMR experiments. There is also a dictionary for protein production.  Tools are in place to extract much of the information for both the x-ray and NMR pipelines. In the fall the PDB will make the full structural genomics view of ADIT available for testing by the community. It is recognized that it may not be possible for capture all of the protein production data items. However, the availability of the expanded view will give information about how much information can realistically be collected.

The second part of the discussion concerned itself with how to best collect information about terminated structures.  It was agreed that the rationale for the “stop” status should be made clearer on TargetDB.  Two models for how to collect information about protein production emerged. The first model would involve collecting all the data about protein production an atomized way using the protein production dictionary. The second would have some of the data items about source and crystallization collected in an atomized way and full text protocols for cloning, expression and purification. There was discussion about the practicalities of collecting all negative results. Many felt that it was very useful, if not vital, to collect this information but a clear picture of how this might be done in practice did not emerge.  It was agreed that having a site with al the protein production information would be an invaluable service to molecular biologists and biochemists. This site could include both "stopped" and active targets. It might be possible to make this information part of the TargetDB site, which would then require renaming.

In order to more fully evaluate the alternatives a discussion forum ( has been set up. It will allow the users to discuss which data items should be required, what data items need to be added and which protocols should be collected.

Data Acquisition Summary

Communicating data between laboratory equipment, software applications, WWW sites, humans and databases is a major challenge within the structure genomics pilot projects.  The objective of the Data Acquisition discussion was to exchange ideas on successful and working examples of data acquisition systems and in particular direct data entry (without human intervention) from the molecular biology laboratory.  Center representatives presented examples of the integration of common laboratory equipment as well as robotics with laboratory information management systems (LIMS) and methods that minimize the need for human data entry and that maximize the capture of complete and unbiased experimental details and results that satisfy the internal consistency rules of the LIMS.  The acquisition of data from humans continues to be a problem with all of the centers, although it was noted that in production environment the compliance with data entry should improve. Substantial amount of discussion was dedicated to implementation of the basic database paradigm: data in, information out.  The data mining of reliable and consistent databases will enhance our ability to define the parameters that lead both to successful and to unsuccessful experiments ( information out).

Database Functionality, Structure, and Tools Summary

(1) There was discussion on comparing monolithic databases to more federated architectures.  A number of advantages were pointed out for the monolithic approach, such as great efficiencies; the experience of the PDB was cited where a number of different databases are being merged into a single one.  It was agreed that monolithic databases can be more efficient and integrated.  However, there were a lot of advantages also mentioned for a more federated approach.  It better matches the social structure of the current structural genomics program, which is disbursed among a number of different centers.  Furthermore, it was not felt that it would be possible to really deeply integrate laboratory databases with a centralized resource because of the different cultures and the different laboratories.  Dr Brenner gave a very good example of trying to adapt a number of databases for the Berkley Center but was unable to.  There was no clear consensus reached, however. A compromise was the idea of having a central hub in structural genomics that was connected to a number of satellite databases.  It was pointed out that if structural genomics can figure out how to use databases to help expedite scientific collaboration in a distributed fashion this might be a model for science in general.

(2) Then the discussion moved to interchange formats.  It was agreed that it is very important for the databases to be able to interchange information whether or not they're in a federated structure, or a more centralized one.  A number of types of interchange approaches were proposed such as CORBA, XML, and various web services such as SOAP and DAS. Most of the discussion focused on XML. The PDB showed a demo of a nice piece of software to enable easy generation of web forms.  Finally, the discussion focused on the importance of mining the data.  It was pointed out that we cannot always anticipate how the data will be mined.  A good example of the depositing of NMR restraints data into the BMRB was suggested, which are now being mined long after they have been deposited.  Finally, it was pointed out that the creation of large highly standardized data sets related to protein production are a most unique contribution of structural genomics that clearly differentiates it from conventional structural biology.

Data Dissemination, Exchange, and Publication Summary

The objective of the Data Dissemination discussion was to exchange ideas on the dissemination of structural genomics experience and success to scientific community. SGI centers, with their unique, high throughput approach to structure determination, collect information on the experimental process in unprecedented detail and are much more open to the idea of disseminating this information, even at preliminary stages prior to structure determination, to the public. As a result, the structural genomics community is in an ideal position to define and implement new, much higher standards of public deposition of structures and develop new tools to make deposits uniform and consistent. In collaboration with the PDB team and with support of NIH we are in a position to popularize this new standard on the rest of the community.

Direct data entry and integration of common laboratory equipment as well as robotics with laboratory information management systems partly implemented in some Centers, creates unbiased experimental data that will satisfy the internal consistency rules of any database. Based on automatic deposit experience, SG community can create easy to use deposit tool that may be used by all crystallographers. The new PDB deposit should have all information about all the protocols used at all steps in structure determination and all relevant information from all steps. The new standard of PDB deposition would aim at providing all information necessary to repeat any structure solution reported to the PDB.

Substantial amount of discussion was dedicated to the idea of creation of the database that would keep results of protein production and attempts of crystallization for the targets that were not solved yet. Easy access to such a database would be invaluable for all community that conduct medically oriented research.

The discussants agreed that targetDB, a database that collects XML files from all genomic centers and allows public to monitor progress of structural genomic centers is a most successful data dissemination project originated from within PSI. However, it is not promoted properly and it is almost unknown in a general biological community. Various strategies to enhance its usability and community profile were discussed.