1 / 37

Contents of this Talk

Contents of this Talk. [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB. Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International

Download Presentation

Contents of this Talk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Contents of this Talk • [Used as intro to Genome Databases Seminar, 2002] • Overview of bioinformatics • Motivations for genome databases • Analogy of virus reverse-eng to genome analysis • Questions to ask of a genome DB

  2. Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com www-db.stanford.edu/dbseminar/seminar.html

  3. Talk Overview • Definition of bioinformatics • Motivations for genome databases • Computer virus analogy • Issues in building genome databases

  4. Definition of Bioinformatics • Computational techniques for management and analysis of biological data and knowledge • Methods for disseminating, archiving, interpreting, and mining scientific information • Computational theories of biology • Genome Databases is a subfield of bioinformatics

  5. Motivations for Bioinformatics • Growth in molecular-biology knowledge (literature) • Genomics • Study of genomes through DNA sequencing • Industrial Biology

  6. Example Genomics Datatypes • Genome sequences • DOE Joint Genome Institute • 511M bases in Dec 2001 • 11.97G bases since Mar 1999 • Gene and protein expression data • Protein-protein interaction data • Protein 3-D structures

  7. Genome Databases • Experimental data • Archive experimental datasets • Retrieving past experimental results should be faster than repeating the experiment • Capture alternative analyses • Lots of data, simpler semantics • Computational symbolic theories • Complex theories become too large to be grasped by a single mind • The database is the theory • Biology is very much concerned with qualitative relationships • Less data, more complex semantics

  8. Bioinformatics • Distinct intellectual field at the intersection of CS and molecular biology • Distinct field because researchers in the field should know CS, biology, and bioinformatics • Spectrum from CS research to biology service • Rich source of challenging CS problems • Large, noisy, complex data-sets and knowledge-sets • Biologists and funding agencies demand working solutions

  9. Bioinformatics Research • algorithms + data structures = programs • algorithms + databases = discoveries • Combine sophisticated algorithms with the right content: • Properly structured • Carefully curated • Relevant data fields • Proper amount of data

  10. Goals of Systems Biology • Catalog the molecular parts lists of cells • Understand the function(s) of each part • Understand how those parts interact to produce the behavior of a cell or organism • Understand the evolution of those molecular parts

  11. Analogy: Genome Analysis andVirus Analysis • Given: Virus binary executable file for known machine architecture • Reverse engineer the program • Procedures • Call graph • Specifications for I/O behavior of the program and all procedures • Capture and publish an annotated analysis of the virus • Comparative analysis of related viruses

  12. Genome Analysis • Example: M. tuberculosis genome • Given: 4.4Mbp of DNA (genome) • Infer: • Molecular parts list of Mtb • A model of the biochemical machinery of Mtb cell • DNA is a blueprint for the program of life

  13. Start 4.4Mbyte binary program 4.4Mbp DNA sequence

  14. Step 1 Distinguish code from data segments Find procedure boundaries Distinguish coding from non-coding regions – Gene Finding

  15. Step 2 Predict semantics of procedures A C B D Predict gene functions

  16. Step 3 Predict procedure call graph D A B C A C B D D A B C Predict biochemical and gene networks

  17. Step 4 Predict conditions under which procedures are invoked D Q R A B S C Predict expression of network fragments

  18. Step 5 Infer complete program specification Formulate dynamic cellular simulation

  19. Step 6 Internet publishing of structured program annotation with explanations, references, commentary Internet publishing of structured genome annotation with explanations, references, commentary

  20. Step 7 Comparative analysis of viruses Evolutionary relationships among viruses Comparative analysis of genomes Evolutionary relationships among genomes

  21. Step 8 Identify measures to disable virus or prevent its spread D Q R A B S C Identify target proteins for anti-microbial drug discovery

  22. Database of Viruses • Create a database that stores • Binaries for all viruses • All annotation of virus programs by different investigators • Comparative analyses • Support • Remote API access • Click-at-a-time browsing

  23. Reference on Major Genome Databases • Nucleic Acids Research Database Issue • http://nar.oupjournals.org/content/vol30/issue1/ • 112 databases

  24. Questions to Ask of a New Genome Database

  25. What are Database Goals andRequirements? • How many users? • What expertise do users have? • What problems will database be used to solve?

  26. What is its Organizing Principle? • Different DBs partition the space of genome information in different dimensions • Experimental methods (Genbank, PDB) • Organism (EcoCyc, Flybase)

  27. What is its Level of Interpretation? • Laboratory data • Primary literature (Genbank) • Review (SwissProt, MetaCyc) • Does DB model disagreement?

  28. What are its Semantics and Content? • What entities and relationships does it model? • How does its content overlap with similar DBs? • How many entities of each type are present? • Sparseness of attributes and statistics on attribute values

  29. What are Sources of its Data? • Potential information sources • Laboratory instruments • Scientific literature • Manual entry • Natural-language text mining • Direct submission from the scientific community • Genbank • Modification policy • DB staff only • Submission of new entries by scientific community • Update access by scientific community

  30. What DBMS is Employed? • None • Relational • Object oriented • Frame knowledge representation system

  31. Distribution / User Access • Multiple distribution forms enhance access • Browsing access with visualization tools • API • Portability

  32. What Validation Approaches areEmployed? • None • Declarative consistency constraints • Programmatic consistency checking • Internal vs external consistency checking • What types of systematic errors might DB contain?

  33. Database Documentation • Schema and its semantics • Format • API • Data acquisition techniques • Validation techniques • Size of different classes • Coverage of subject matter • Sparseness of attributes • Error rates

  34. Relationship of Database Field toBioinformatics • Scientists generally ignorant of basic DB principles • Complex queries vs click-at-a-time access • Data model • Defined semantics for DB fields • Controlled vocabularies • Regular syntax for flatfiles • Automated consistency checking • Most biologists take one programming class • Evolution of typical genome database • Finer points of DB research off their radar screen • Handfull of DB researchers work in bioinformatics

  35. Database Field • For many years, the majority of bioinformatics DBs did not employ a DBMS • Flatfiles were the rule • Scientists want to see the data directly • Commercial DBMSs too expensive, too complex • DBAs too expensive • Most scientists do not understand • Differences between BA, MS, PhD in CS • CS research vs applications • Implications for project planning, funding, bioinformatics research

  36. Recommendation • Teaching scientists programming is not enough • Teaching scientists how to build a DBMS is irrelevant • Teach scientists basic aspects of databases and symbolic computing • Database requirements analysis • Data models, schema design • Knowledge representation, ontologies • Formal grammars • Complex queries • Database interoperability

More Related