Structure Databases: The Protein Data Bank

Structure Databases:The Protein Data Bank Swanand Gore & Gerard Kleywegt PDBe – EBI May 7th 2010, 9-10 am Macromolecular Crystallography Course

Outline • Structural Biology and Bioinformatics • Databases in Structural Bioinformatics • Protein Data Bank • PDBe

Promise of Structural Biology • Basic research • Insights in biophysics of folding • Insights into Evolution • Insights into enzymatic catalysis • Applications • Design of drug / antibody / epitope / pesticide / enzymes • Design of new materials • Understanding disease • Structural bioinformatics • Big computational and informatics toolbox • Full of techniques to translate insights to application • Databases are a vital aspect

Sequence-Structure-Function Sequence Prediction Modelling Determination Archival / Retrieval Classification Structure Searching Mining Comparison Alignment Design Engineering Function

A rich toolbox

Databases are central to structural bioinformatics pipeline Align Compare Mine Classify Determine Annotate Model Predict Primary Structural Databases Secondary Structural Databases

Databases help in Structure Determination • Dihedral preferences • Ramachandran contours • Sidechainrotamer libraries • RNA backbone and puckers • Likely ring conformations • Small-molecules (CCDC) • Molecular replacement • Choice of probe using homology • fragment-based MR • Validation • Electron density server and PrEDS • Dunbrack, R.L., Jr. Rotamer libraries in the 21st century. Curr. Opin. Struct. Biol. 12:431-440, 2002. • Jane S. Richardson et al (2008) "RNA Backbone: Consensus All-angle Conformers and Modular String Nomenclature (an RNA Ontology Consortium contribution)" RNA 14 :465-481 • The Cambridge Structural Database: a quarter of a million crystal structures and rising, F. H. Allen, /ActaCryst./, B*58*, 380-388, 2002 • S.C. Lovell et al. (2003) "Structure Validation by Cα Geometry: φ,ψ and Cβ Deviation." Proteins: Structure, Function and Genetics 50, 437-450. • Claude et al. CaspR: a web server for automated molecular replacement using homology modelling. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W606-9. • McCoy, A.J., Grosse-Kunstleve, R.W., Adams, P.D., Winn, M.D., Storoni, L.C. and Read, R.J. (2007). Phaser crystallographic software. J. Appl. Cryst. 40: 658-674. • Gubbi et al. (2007) Solving Protein Structures Using Molecular Replacement Via Protein Fragments, Lecture Notes In Artificial Intelligence;.Vol. 4578. 627. • GJ Kleywegt et al. (2004) "The Uppsala Electron-Density Server", ActaCrystallographica, D60, 2240-2249

Databases are vital to archiving structures! • Structures represent invaluable scientific insights • But it is costly to solve a structure • Time, effort, money • Organize and safe-keep painstakingly determined data • Formal mechanisms of arranging, searching, backing up • Wide-ranged access to invaluable repository without compromising data integrity • Very low cost of maintenance in comparison with the cost of content!

Databases are vital to archiving structures • “Database is a structured collection of data held in computer storage, often incorporating software to make it accessible in various ways” • Databases • Provide accessibility with safety and persistence • Provide context for your data against other data • Facilitate comparisons and data-mining • Primary structural databases • Experimental data and model coordinates • NDB, wwPDB, BMRB, CSD, EMDB • Secondary structural databases • Classification, function annotation • SCOP, EC2PDB, PALI, and many many more!

Databases / Archival / Retrieval • Formats of databases • Flat files (csv, tsv, columnar), supporting scripts • Relational (MySQL, Oracle): professional, indexed • Access • Modes: read, write, edit, delete (PDB provides entry deposition mechanisms) • Means: Download (wwPDB ftp), Command-line or GUI (SQL queries, Oracle desktop client), Web-based interfaces (PDBeDatabase service) • Access frequency • Schema design • Tables, primary keys, foreign keys, views…. • Normal forms: avoid data repetition, inconsistencies

Databases for Classification • Structural hierarchy • CATH • Class, Architecture, Topology, Homology • SCOP • Class, Fold, Superfamily, Family • Enzyme hierarchy • EC-PDB • Oxidoreductase, ligase, lyase, isomerase, hydrolase, transferase. • Functional ontology • GOA • Gene Ontology: Cellular component, Biological process, Molecular Function • Linked to structures via SIFTS • Christos A. Ouzounis et al. (2005)Classification schemes for protein structure and function Nature Reviews Genetics 4, 508-519. • Andreeva et al. (2007) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36:D419 • Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29 • Barrell D. et al. (2009) The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Research 2009 37: D396-D403.

Databases for Comparison • Structural and structure-sequence alignments • Phylogeny • Evolutionary trace • Evolutionarily important residues • Mapping onto structure • Mizuguchi K, Deane CM, Blundell TL, Overington JP. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7:2469-2471. • SISYPHUS - structural alignments for proteins with non-trivial relationships Andreeva et al, Nucleic Acid Research Database Issue 2007, 35, D253-D259 • Gowri, V. S. Et al. (2003). Integration of related sequences with protein three-dimensional structural families in an updated Version of PALI database. Nucleic Acids Res. 2003 31: 486-488. • Bhaduri A, Pugalenthi G, Sowdhamini R. PASS2: an automated database of protein alignments organised as structural superfamilies. BMC Bioinformatics. 2004, 5:35 • DBAli tools: mining the protein structure space. Marc A. Marti-Renom et al. Nucleic Acids Research, doi:10.1093/nar/gkm236 • Whelan, S., P.I.W. de Bakker, & N. Goldman. (2003). Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics 19:1556-1563 • The Pfam protein families database:,R.D. Finn,et al, Nucleic Acids Research (2010) Database Issue 38:D211-222 • Morgan, D.H., D.M. Kristensen, D. Mittleman, and O. Lichtarge. ET Viewer: An Application for Predicting and Visualizing Functional Sites in Protein Structures. Bioinformatics. 2006 Aug 15;22(16):2049-50

Databases for Annotation • SNPs • Active / allosteric sites • Servant F. rt al (2002) ProDom: Automated clustering of homologous domains. Briefings in Bioinformatics. vol 3, no 3:246-251 • Marchler-Bauer A,et al CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2009 Jan;37(Database issue):D205-10 • Hulo N., Bairoch A., Bulliard V., Cerutti L., Cuche B., De Castro E., Lachaize C., Langendijk-Genevaux P.S., Sigrist C.J.A. The 20 years of PROSITE. Nucleic Acids Res. 2007 • SitesBase: a database for structure-based protein–ligand binding site comparisons , Nicola D. Gold and Richard M. Jackson, Nucleic Acids Research, 2006, Vol. 34, Database issue D231-D234 • sc-PDB: an Annotated Database of Druggable Binding Sites from the Protein Data Bank, Esther Kellenberger et al, J. Chem. Inf. Model., 2006, 46 (2), pp 717–727 • Binding MOAD, a high-quality protein–ligand database. Mark L. Benson et al, Nucleic Acids Research 2008 36(Database issue):D674-D678 • SNPeffect v2.0: a new step in investigating the molecular phenotypic effects of human non-synonymous SNPs . Joke Reumers at al, Bioinformatics 2006 22(17):2183-2185 • Domains

Databases for Annotation • Surface properties, cavities:Voronoia, • Binding partners • Small molecule: TIMBAL, CREDO • Protein, DNA – PiBase JAIL, BIPA • CREDO: A Protein-Ligand Interaction Database for Drug Discovery.Adrian Schreyer, Tom Blundell. Chemical Biology & Drug Design, Vol. 73, No. 2. (February 2009), pp. 157-167 • BIPA: a database for protein–nucleic acid interaction in 3D structures. Semin Lee and Tom L Blundell, Bioinformatics 2009 25(12):1559-1560 • PIBASE: a comprehensive database of structurally defined protein interfaces. Davis FP and Sali A, Bioinformatics. 2005 May 1;21(9):1901-7. • JAIL: a structure-based interface library for macromolecules. Stefan Günther et al. Nucleic Acids Res. 2009 January; 37(Database issue): D338–D341 • ElkeMichalsky et al., SuperLigands – a database of ligand structures derived from the Protein Data Bank, BMC Bioinformatics 2005, 6:122 • Voronoia: analyzing packing in protein structures. Rother K et al. Nucleic Acids Res. 2009 Jan;37(Database issue):D393-5. • CASTp: Computed Atlas of Surface Topography of proteins. Binkowski et al. Nucleic Acids Res. 2003 Jul 1;31(13):3352-5. • The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Craig T. Porter, Gail J. Bartlett, and Janet M. Thornton (2004) Nucl. Acids. Res. 32: D129-D133. • Residues critical to enzyme mechanism

Databases of Analysis / Mining • Secondary structure: SSEP • Active sites • Protein-peptide interactions • Loop databases • Protein Coil Library • Protein Loop Classification • Loops in Proteins • Protein Topology Graph Library • Frequent structural motifs • Oliva et al (1997) An automated classification of the structure of protein loops. J Mol Biol 266 (4): 814-830. • SSEP: secondary structural elements of proteins , V. Shanthi, P. Selvarani, Ch. Kiran Kumar, C. S. Mohire and K. SekarNucleic Acids Research, 2003, Vol. 31, No. 13 3404-3405 • PepX: a structural database of non-redundant protein-peptide complexes. Vanhee F et al., Nucleic Acids Res. 2010 Jan;38(Database issue):D545-51. • Baeten L, et al. (2008) Reconstruction of Protein Backbones from the BriX Collection of Canonical Protein Fragments. PLoSComputBiol 4(5): e1000083. doi:10.1371/journal.pcbi.1000083 • Bystroff C & Baker D. (1998). Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 281, 565-77. • LigBase: a database of families of aligned ligand binding sites in known protein sequences and structures. Stuart AC et al., Bioinformatics. 2002 Jan;18(1):200-1. • PTGL—a web-based database application for protein topologies. Patrick May et al. Bioinformatics 2004 20(17):3277-3279; doi:10.1093/bioinformatics/bth367 • Fitzkee, N. C., Fleming, P. J, Rose G. D. (2005) The Protein Coil Library: a structural database of nonhelix, nonstrand fragments derived from the PDB. Proteins. 58 (4): 852-4.

Databases in Prediction • Oligomeric state • PISA at PDBe • 3D coordinates • ab-initio folding • homology models • Possible binding partners and binding modes • small-molecule (PRECISE) • protein-protein (ADAN) • Dynamics, conformational changes • MolMovDB • Cellular location • LOC3D: annotate sub-cellular localization for protein structures. Nair R, Rost B., Nucleic Acids Res. 2003 Jul 1;31(13):3337-40. • MolMovDB: analysis and visualization of conformational change and structural flexibility. Echols N et al., Nucleic Acids Res. 2003 Jan 1;31(1):478-82. • ADAN: a database for prediction of protein-protein interaction of modular domains mediated by linear motifs. Encinar JA et al., Bioinformatics. 2009 Sep 15;25(18):2418-24. Epub 2009 Jul 14. • PRECISE: a Database of Predicted and Consensus Interaction Sites in Enzymes . Shu-HsienSheu et al., Nucleic Acids Research, 2005, Vol. 33, Database issue D206-D211 • MODBASE, a database of annotated comparative protein structure models and associated resources. Ursula Pieper et al., Nucleic Acids Research 37, D347-D354, 2009. • Krissinel E, Henrick K. Inference of macromolecular assemblies from crystalline state. J. Mol. Biol. (2007) 372:774–797. • S. M. Larson . Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problems in computational biology. Mod Meth Comp Biol, R. Grant, ed, Horizon Press (2003)

Specialized databases with structures • MCSIS (GPCRs, Prions etc) • Carbohydrates • KEGG Glycans • Antibodies (Abysis) • Lysozymes • Abysis: http://www.bioinf.org.uk/abysis/ • Horn F., Vriend G., Cohen FE. Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res. 29:346-349 (2001) • LySDB - Lysozyme Structural DataBase. Mohan KS et al., ActaCrystallogr D BiolCrystallogr. 2004 Mar;60(Pt 3):597-600.

The Protein Data Bank • Unique primary database • Single archive of experimentally determined macromolecular (biopolymer) structures • ~ 65000 entries • Distributed online • Updated weekly • Numerous databases derived and enriched with PDB data • Many frontends- RCSB, PDBe, PDBsum, OCA, MMDB, Jena, SIB • “The PDB” is a flat-file archive • PDB formatted coordinate files • any experimental data when submitted

The Protein Data Bank • International Effort • Curated by RCSB, PDBe, PDBj, BMRB • ftp archive currently operated by RCSB

PDBe 37 million data downloads PDBj 14 million data downloads RCSB PDB 200 million data downloads FTP traffic at PDB sites

The Protein Data Bank • When is a biopolymer PDB-worthy? • Polypeptides • Gene products • Non-ribosomal • Synthetic peptides > 23 residues • Unless clearly biologically significant • Polynucleotides • > 3 residues • Sugars • > 3 sugar residues • Fibers • Only repeating unit deposited

Annual Growth of PDB Primary databases differ by magnitudes in size. < 105 structures GenBank 1011 base pairs 108 gene sequences UniprotKB 107 protein sequences http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html http://www.ebi.ac.uk/uniprot/TrEMBLstats/

Annual Growth of PDB EM rising… Dominated by x-ray!

Redundancy in PDB(as in Nov’08) • Entries > 54,000 • Chains > 120,000 • Copies of a chain in same entry • Homo-oligomers • Same chains in different entries • Determined by multiple labs • Determined under different conditions • Complexed with different partners • Mutants • Chains < 8700 at seq.id < 30% • Orthologs, paralogs are very similar • Using non-redundant chains from PDB • PISCES server • WHATIF, CATH, SCOP, DALI sets • G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.

File formats at PDB • The .pdb format • Header • Remarks • experimental setup • Refinement details • oligomeric state • deviations from expected geometry • Biochemical entities • Biopolymers, het groups • Coordinates • 3D model of the entity • Multiple coordinates for same entity can exists • MODELs, altloc identifiers • Structure factors • .cif file

File formats at PDB XML mmCIF

The PDB format: header 123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+ HEADER RETINOIC-ACID TRANSPORT 28-SEP-94 1CBS 1CBS 2 COMPND CELLULAR RETINOIC-ACID-BINDING PROTEIN TYPE II COMPLEXED 1CBS 3 COMPND 2 WITH ALL-TRANS-RETINOIC ACID (THE PRESUMED PHYSIOLOGICAL 1CBS 4 COMPND 3 LIGAND) 1CBS 5 SOURCE HUMAN (HOMO SAPIENS) 1CBS 6 SOURCE 2 EXPRESSION SYSTEM: (ESCHERICHIA COLI) BL21 (DE3) 1CBS 7 SOURCE 3 PLASMID: PET-3A 1CBS 8 SOURCE 4 GENE: HUMAN CRABP-II 1CBS 9 AUTHOR G.J.KLEYWEGT,T.BERGFORS,T.A.JONES 1CBS 10 REVDAT 1 26-JAN-95 1CBS 0 1CBS 11 Column 1-6 Record type Column 7-72 - human-readable, mostly textual information

Atom nr X, Y, Z coordinates Atom name Occupancy Residue type “B-factor” Chain name Residue nr The PDB format: coordinates HETATM 1 C ACE A 0 4.279 14.829 14.190 1.00 19.08 C HETATM 2 O ACE A 0 3.706 14.098 15.038 1.00 20.62 O HETATM 3 CH3 ACE A 0 3.827 16.236 14.001 1.00 20.22 C ATOM 4 N MET A 1 5.514 14.621 13.695 1.00 17.77 N ATOM 5 CA MET A 1 6.269 13.401 13.959 1.00 16.51 C ATOM 6 C MET A 1 6.702 13.319 15.400 1.00 16.41 C ATOM 7 O MET A 1 7.036 12.248 15.870 1.00 15.38 O ATOM 8 CB MET A 1 7.529 13.301 13.085 1.00 16.52 C ATOM 9 CG MET A 1 7.292 12.805 11.676 1.00 16.48 C

Protein Data Bank in Europe • PDBe • European node of wwPDB • Started 1996 as MSD at EBI • Deposition site since 1999 • Started EMDB in 2002 • PDBe operations • Handle deposition and annotation of PDB and EMDB entries • Build advanced structure databases • Build services for search, browsing, analysis • Liaise with broader structural biology community • Coordinate with other databases e.g. Uniprot • Funding • PDBe: Protein Data Bank in Europe. S. Velankar et al., • Nucleic Acids Research, doi:10.1093/nar/gkp916

PDBe Deposition and Annotation • Checks • Is format correct? • Are biopolymer sequences in biochemical entities consistent with 3D models? • Are hetero groups named correctly? • Where all does model deviate from expected geometry? • Record various types of information • Experiment: Method, conditions, data resolution, spacegroup, completeness etc. • Sample: source, expression system, engineered etc. • Refinement: program, target AutoDep Deposition Tool

AutoDep provides valuable information to depositors • Validation of structure factors • EDS criteria • http://www.ebi.ac.uk/pdbe-xdep/autodep/index.jsp

AutoDep provides valuable information to depositors Heterogen summary and Validation against ideal representations of ligands

AutoDep provides valuable information to depositors Oligomeric state - PQS Sequence-structure alignment Uniprot, Pfam, Interpro

AutoDep provides valuable information to depositors • Revisions, withdrawal, release • Release sequence-only immediately • Release coordinates immediately • Hold for 1 year • Release after publication • Communication with depositors • Help depositors understand and conform to PDB standards • Discussing errors

PDBe Services PISA, SSM/ PDBeFold, PDBeMotif, PDBeChem, SIFTS, PDBeStatistics, PDBeSearch, PDBeView PDBe Services

PDBe Services PDBeView – the Atlas pages • http://www.ebi.ac.uk/pdbe-srv/view/

PDBe Services PDBeFold (SSM): has my fold been seen before? Or is it novel! PDB ??? • E. Krissinel and K. Henrick, Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. ActaCryst. (2004). D60, 2256±2268.

PDBe Services PDBeFold (SSM) • Why compare structures? • Reveal conformational changes • Ligands, mutations, crystal packing, pH.. • Judge structural variability • NMR ensembles, structure families • Discover common structural motifs • Identify fold • Infer function • Sequence-alignments do not work well for distant evolutionary relationships • Structures diverge much slowly than sequences • Structure improves quality of alignment • Better inference of function, e.g. when active sites match well • The relation between the divergence of sequence and structure in proteins. Chothia C, Lesk AM. EMBO J. 1986 Apr;5(4):823-6.

PDBe Services H1 H2 PDBeFold (SSM) algorithm S1 S4 Iterative expansion of Ca-alignment S3 S2 H4 H1 H5 S2 S1 H2 H3 S7 Match SSE graphs to get initial alignment H6 S6 S3 S4 S5

PDBe Services PDBeFold (SSM) SSM can carry out genuine multiple structure alignment to reveal a motif common to a family of structures

PDB file (ASU) Biological Unit PDBe Services PDBePISA • What is the likely biological assembly of a given structure? • Can I learn about it from crystal-packing of chains? PISA Generate possible assemblies Rank according to free energy Crystal Symmetry ASU

Biological unit 1P30 Homotrimer! PDBe Services PDBePISA PDB entry 1P30 A monomer?

Biological Unit 2TBV 180-mer! PDBe Services PDBePISA PDB entry 2TBV A trimer?

PDBe Services PDBePISA

2 Biological Units in 1E94: A dodecamer and a hexamer! PDBe Services PDBePISA PDB entry 1E94

PDBe Services • A very powerfulengine to search PDB • Structure-sequence general searches • Chemical substructure • Predefined frequent motifs • Arbitrary secondary structure patterns • Φψ patterns • Protein sequences • Prosite motif, Uniprot, CSA accessions • Raw sequence • Regular expression • Interactions between ligands, protein • Seq-distance between protein motifs • PDB header searches • Specialized searches • Envionment around an interaction • Motif binding • Occurrence of a motif inside another PDBeMotif • MSDmotif: exploring protein sites and motifs. Adel Golovin and Kim Henrick. BMC Bioinformatics 2008, 9:312

PDBe Services PDBeMotif: which motif does my substructure bind often? StaurosporineKinaseinhibitor

PDBe Services PDBeMotif: which ligands and chemical fragments does my sequence motif bind? Chemical fragments Tyrosine protein kinase-specific active-site signature: [LIVMFYC]-{A}-[HY]-x-D-[LIVMFY]-[RSTAC]-{D}-{PF}-N-[LIVMFYC](3) Motif binding statistics

PDBe Services PDBeMotif: how does a sequence motif look like in 3D? Tyrosine protein kinase-specific active-site signature: [LIVMFYC]-{A}-[HY]-x-D-[LIVMFY]-[RSTAC]-{D}-{PF}-N-[LIVMFYC](3) Sequence hits 3D alignment

PDBe Services PDBeMotif: which sequences often host a Ramachandran path? 3D fragment φ/ψ sequence -156/-155,-103/17,-134/161 Search Sequence pattern

Structure Databases: The Protein Data Bank