1 / 160

Genome, Protein and Model Organism Databases

Genome, Protein and Model Organism Databases. Anne Estreicher Swiss-Prot Group Swiss Institute of Bioinformatics Geneva – Switzerland Anne.Estreicher@isb-sib.ch. Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009.

kishi
Download Presentation

Genome, Protein and Model Organism Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome, ProteinandModel Organism Databases Anne Estreicher Swiss-Prot Group Swiss Institute of Bioinformatics Geneva – Switzerland Anne.Estreicher@isb-sib.ch Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009

  2. Outline • Introduction (definitions, history…) • From DNA sequence to genomic tools • The flow of information: from DNA to proteins • Protein sequence databases • MODs at a glance

  3. A collection of related data, which are structured searchable updated periodically cross-referenced Includes also associated tools necessary for access/query, download, etc. What is a database ?

  4. Why do we need databases ? • Data need to be stored, curated and made available for analysis and knowledge discovery • Efficient way of sharing data, independently of regular publications • Essential resources for both experimental and computational biologists

  5. Databases in biology : not a new issue … • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins)

  6. The first protein sequence "database" by Margaret Dayhoff (1965) contained 65 proteins

  7. Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • Mid 70s Improvements in DNA sequencing • 1979 Los Alamos Sequence Library (Walter Goad) • 1980~ 80 genes fully sequenced -> Need to store the data and to make them available for analysis (in format acceptable for human eyes and machines) -> ARCHIVE -> RACE for the central position in life sciences… And the winner is…

  8. Databases: not a new issue… EMBL-Bank - Europe 1980 GenBank - USA 1982 DDBJ - Asia 1986 leading to the establishment of the INSDC(International Nucleotide Sequence Database Collaboration) -> daily exchanges of data

  9. www.insdc.org

  10. EMBL-BANK - GenBank - DDBJ • Main resources for DNA and RNA sequences; • Used to be retrieved from publications -> direct submissions from individual researchers, genome sequencing projects and patent applications: • “Journal publishers generally require sequence deposition prior to publication so that an accession number can be included in the paper.” • 1. True for nucleic acid, not for protein sequences; • 2. Not always put into practice • => Not submitted sequences are LOST!!! • Archives (primary databases) • data belong to submitters

  11. EMBL-BANK - GenBank - DDBJ Archive (primary databases) => data belong to the submitter • Minimal checks, such as vector contamination • Annotation by the submitters

  12. Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982EMBL-Bank - DNA • 1984 GenBank – DNA • 1986 DDBJ - DNA

  13. Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982EMBL-Bank - DNA • 1984 GenBank – DNA • 1986 DDBJ - DNA -> ARCHIVES (primary databases) may not be sufficient -> need to annotate the data to produce KNOWLEDGE • 1986 Swiss-Prot – protein sequences – a paradigm for annotated (secondary) databases

  14. The Swiss-Prot concept • non-redundant: Protein products of 1 gene / 1 species -> 1 entry, • Manually annotated (=> curator judgement on data!), • Highly cross-referenced (1st life-science database to provide cross-references) (links to > 130 databases from www.uniprot.org).

  15. Databases: not a new issue… • 1954First protein sequence (insulin by F. Sanger) • 1965Atlas of Protein Sequence and Structure (65 proteins) • 1979 Los Alamos Sequence Library (Walter Goad) – DNA • 1982EMBL-Bank - DNA • 1984 GenBank – DNA Protein information resource (PIR) – Protein sequences • 1986 DDBJ – DNA Swiss-Prot – protein sequences • 1996TrEMBL (Translated EMBL) – Protein sequences Complement of Swiss-Prot to cope with the increasing amount of new sequences; AUTOMATIC ANNOTATION !

  16. UniProtKB/Swiss-Prot growth Swiss-Prot rel. 57.5 (07-Jul-2009): 470’369entries 1996: creation of TrEMBL Swiss-Prot: 52’205 entries TrEMBL: 61’137 entries Number of entries Release number 1986 3’939 entries

  17. UniProtKB growth TrEMBL rel.40.5 (07-Jul-2009): 8’594’382entries Swiss-Prot rel.57.5 (07-Jul-2009): 470’369 entries • TrEMBL growth(sequences/day) • 2004  1’500 • 2006-2007  3’500 •  >5’000 •  ~8’000 Number of entries TrEMBL Automated curation Swiss-Prot Manual curation Release number 1986 1996 2009

  18. New challenge • Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery

  19. Life sciences used to be rich in hypotheses, well-off in knowledge and poor in data; Today they are very rich in data, not so well-off in knowledge and very poor in hypotheses. ? List of parts Complex system (R)evolution of these last 20 years

  20. Science (1993) 262, 502

  21. Danger ! EMBL Database Growth http://www.ebi.ac.uk/embl/Services/DBStats/

  22. http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat In 4 months, 374 new genomes and 77 were completed ~ 100 genomes/month (in 2008 -> ~50 genomes/month) + ~2’360 viral (& viroid) genomes => Total ~ 5’600 genomes 

  23. http://genomesonline.org/index2.htm

  24. http://www.genomesonline.org/gold.cgi

  25. http://www.genomesonline.org/gold.cgi

  26. Metagenomics:study of genetic material recovered directly from environmental samples Global Ocean Sampling (C. Venter) Whale fall Soil, sand beach, New-York air, … Human fluids, mouse gut … Venter’s Sorcerer II

  27. Flood in the world of proteins… • 1965: first protein sequence "database" by Margaret Dayhoff (65 proteins) • July 2009: ~ 20 millions unique protein sequence (source UniParc - http://www.uniprot.org/uniparc/) UniParc: non-redundant database that contains most of the publicly available protein sequences in the world (includes sequences from EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, Ensembl, FlyBase, H-Invitational Database (H-Inv), International Protein Index (IPI), Patent Offices (EPO, JPO and USPTO), PIR-PSD, Protein Data Bank (PDB), Protein Research Foundation (PRF), RefSeq, Saccharomyces Genome database (SGD), TAIR Arabidopsis thaliana Information Resource, TROME, UniProtKB/Swiss-Prot and TrEMBL, Vertebrate Genome Annotation database (VEGA) and WormBase).

  28. New challenge • Flood of data • Flood of databases…

  29. NAR 1st issue of the year is always dedicated to databases + "clean" list of databases provided (! not exhaustive !)

  30. The NAR Online Molecular Biology Database collection in 2009 A total of 1’170 databases (19 obsolete removed) http://www.oxfordjournals.org/nar/database/a/

  31. NAR "clean" list of databases http://www.oxfordjournals.org/nar/database/a/

  32. Most recent NAR paper about the database (not available for all db, some described in other journals)

  33. A "clean" list of can be found in the NAR online molecular biology database collection http://www.oxfordjournals.org/nar/database/a/

  34. BIOLOGICAL DATABASE CATEGORIES • Databases of nucleic acid sequences (RNA, DNA) • Databases of protein sequences • Databases of protein motifs and protein domains • Databases of structures • Databases of genomes • Databases of genes • Databases of expression profiles • Databases of SNPs and mutations • Databases of metabolic pathways • Databases of protein interactions • Databases of taxonomy • … Databases containing sequences or data directly derived from sequences.

  35. DNA sequences : What ? Where ? How ? & genomic tools NCBI UCSC

  36. Stable accession number (should always be cited in publications) Possible molecule types: genomic DNA and RNA mRNA other DNA and RNA rRNA transcribed RNA tRNA unassigned DNA and RNA viral cRNA Accession number Molecule type Date of submission Definition GenBank entry AF415175 http://www.ncbi.nlm.nih.gov/nuccore/16589063 Nucleotide sequence

  37. Accession number Molecule type Date of submission Definition Taxonomy Nucleotide sequence

  38. Accession number Molecule type Date of submission Definition Taxonomy References Nucleotide sequence

  39. Accession number Molecule type Date of submission Definition Taxonomy References Organism Molecule type Chromosomal location Tissue type Gene name CDS annotation => protein sequence + Protein IDentifier (PID: stable identifier & version number) Features: Information provided by the submitter May include annotation of the sequence Nucleotide sequence

  40. Gives access to the nucleic acid sequence of the CDS (not of the entire mRNA) Protein sequence

  41. "Features"  may provide much more information depending upon the sequence and the submitter… 3’end of chromosome Y  EMBL #AJ271736

  42. Very similar view, links and options from the 3 sites: EMBL-Bank – GenBank - DDBJ http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/ http://www.ddbj.nig.ac.jp/

  43. How to find a DNA sequence at the NCBI…

  44. http://www.ncbi.nlm.nih.gov/

  45. Databases @ NCBI http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html The Entrez system: integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others => Maximalinterconnectivity

  46. Databases @ NCBI http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html

  47. Simple search with a EMBL-Bank/GenBank/DDBJ accession number

More Related