1 / 20

Data Harvesting: automatic extraction of information necessary

Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory. Progress of a PX project. Structure Deposition (PDB @ RCSB, EBI). Data Collection (synchrotron, home source).

Download Presentation

Data Harvesting: automatic extraction of information necessary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury Laboratory

  2. Progress of a PX project Structure Deposition (PDB @ RCSB, EBI) Data Collection (synchrotron, home source) Database Queries Structure Solution (CCP4 etc.)

  3. PROTEIN DATABANK • international repository for the processing and distribution of 3-D macromolecular structure data determined experimentally by X-ray crystallography and NMR. • data deposited to PDB at RCSB (U.S.) and EBI (U.K.)

  4. USES OF PDB • Retrieval of data of single structure • Global searches (e.g. for molecule name, particular cofactor, etc.) • Generating statistics (e.g. structures vs. resolution) • Derived databases (e.g. ReLiBase, scop/CATH)

  5. Examples of deposited information • Name of source organism • Reference to sequence database entry • Temperature of diffraction expt. • No. of unique reflections • Rmerge as function of resolution • Starting model for molecular replacement • Restraints used in refinement • Identification of secondary structure elements • Atomic coordinates and structure factor amplitudes

  6. HARVESTING CONCEPT • Pioneered by EBI deposition centre. • Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site • Why? • More reliable data • Richer database

  7. HARVEST: Action • Action of harvesting is entirely local. • A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. • Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.

  8. HARVEST: File Format • mmCIF has been selected as the format to represent harvest (deposition) data items • several files are generated • mmCIF relationships not necessarily maintained • ‘TRUE’ final complete mmCIF file only generated after complete processing of a submission at the deposition site

  9. Identifying harvesting files • Each run of a harvesting program produces a single file. • Files identified by Project Name and Dataset Name.

  10. Project Name • Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited • Equivalent to a PDB idcode or _entry.id • E.g. • A new native structure • A mutant structure • A ligand protein complex

  11. Dataset Name • Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name. • Equivalent to _diffrn.id • E.g. • Each wavelength in a MAD experiment • Each Heavy atom derivative • Each different NMR experiment carried out in the course of a structure determination

  12. Management of harvest Files • CCP4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/DepositFiles/PName/DName.ProgName_mode • Files sent to EBI at time of deposition. • Ultimately the individual research worker is responsible for the management of their own data files.

  13. HARVEST: Problems • Management of harvesting files: • A structure may be solved by more than one user • A structure may be solved using different machines not NFS connected • More than one run and which run is FINAL? • Scope of harvesting: • Need to persuade software authors to adopt protocol • Still need manual addition/checking of information

  14. Implementation in CCP4 • Harvesting files produced by: • [MOSFLM] (data processing) • SCALA / TRUNCATE (data reduction) • MLPHARE (phasing) • RESTRAIN / REFMAC (refinement) • Associated libraries: • libccif - Peter Keller’s suite of routines to read and write mmCIF files • harvlib.f - Kim Henrick’s Fortran front end to libccif Public release - January 2000

  15. Example: SCALA output (1) data_phosphate_binding_protein[A197C_chromophore_x] _entry.id phosphate_binding_protein _diffrn.id A197C_chromophore_x_audit.creation_date 1997-10-30T12:43:41+00:00_software.classification 'data reduction'_software.contact_author 'P.R. Evans'_software.contact_author_email pre@mrc-lmb.cam.ac.uk_software.description 'scale together multiple observations of reflections'_software.name Scala_software.version 'CCP4_2.2.3 1/7/97'

  16. Example: SCALA output (2) _diffrn_reflns.d_res_low 35.36 _diffrn_reflns.d_res_high 3.00 _diffrn_reflns.number_measured_all 17986 _diffrn_reflns.number_unique_all 6645 _diffrn_reflns.number_centric_all 363 _diffrn_reflns.number_anomalous_all 2348 _diffrn_reflns.Rmerge_I_anomalous_all 0.050

  17. User Input • For each program run, user can specify: • Project Name • Dataset Name • USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs • NOHARVEST - do not write harvest file

  18. Automation • All that program needs to know is Project Name and Dataset Name • This information carried between programs in header section of reflection file (MTZ file) • Information written to reflection file as soon as possible (ideally written to image files and passed on).

  19. Current status • Harvesting software released as part of CCP4 in January 2000. No harvesting files sent to EBI as yet (early days!) • CNS also produces harvesting files, and some use of these • Plans to extend to concept to data from NMR and EM

  20. Acknowledgements • Kim Henrick, Peter Keller (EBI) • Eleanor Dodson, Phil Evans (CCP4) • BBSRC http://www.dl.ac.uk/CCP/CCP4/newsletter35/dataharvest.html http://www.dl.ac.uk/CCP/CCP4/newsletter37/13_harvest.html

More Related