Contents of this talk
1 / 37

Contents of this Talk - PowerPoint PPT Presentation

  • Uploaded on

Contents of this Talk. [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB. Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Contents of this Talk' - justin-macdonald

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Contents of this talk
Contents of this Talk

  • [Used as intro to Genome Databases Seminar, 2002]

  • Overview of bioinformatics

  • Motivations for genome databases

  • Analogy of virus reverse-eng to genome analysis

  • Questions to ask of a genome DB

Overview of genome databases

Overview of Genome Databases

Peter D. Karp, Ph.D.

SRI International

Talk overview
Talk Overview

  • Definition of bioinformatics

  • Motivations for genome databases

  • Computer virus analogy

  • Issues in building genome databases

Definition of bioinformatics
Definition of Bioinformatics

  • Computational techniques for management and analysis of biological data and knowledge

    • Methods for disseminating, archiving, interpreting, and mining scientific information

  • Computational theories of biology

  • Genome Databases is a subfield of bioinformatics

Motivations for bioinformatics
Motivations for Bioinformatics

  • Growth in molecular-biology knowledge (literature)

  • Genomics

    • Study of genomes through DNA sequencing

    • Industrial Biology

Example genomics datatypes
Example Genomics Datatypes

  • Genome sequences

    • DOE Joint Genome Institute

      • 511M bases in Dec 2001

      • 11.97G bases since Mar 1999

  • Gene and protein expression data

  • Protein-protein interaction data

  • Protein 3-D structures

Genome databases
Genome Databases

  • Experimental data

    • Archive experimental datasets

    • Retrieving past experimental results should be faster than repeating the experiment

    • Capture alternative analyses

    • Lots of data, simpler semantics

  • Computational symbolic theories

    • Complex theories become too large to be grasped by a single mind

    • The database is the theory

    • Biology is very much concerned with qualitative relationships

    • Less data, more complex semantics


  • Distinct intellectual field at the intersection of CS and molecular biology

  • Distinct field because researchers in the field should know CS, biology, and bioinformatics

  • Spectrum from CS research to biology service

  • Rich source of challenging CS problems

  • Large, noisy, complex data-sets and knowledge-sets

  • Biologists and funding agencies demand working solutions

Bioinformatics research
Bioinformatics Research

  • algorithms + data structures = programs

  • algorithms + databases = discoveries

  • Combine sophisticated algorithms with the right content:

    • Properly structured

    • Carefully curated

    • Relevant data fields

    • Proper amount of data

Goals of systems biology
Goals of Systems Biology

  • Catalog the molecular parts lists of cells

  • Understand the function(s) of each part

  • Understand how those parts interact to produce the behavior of a cell or organism

  • Understand the evolution of those molecular parts

Analogy genome analysis and virus analysis
Analogy: Genome Analysis andVirus Analysis

  • Given: Virus binary executable file for known machine architecture

  • Reverse engineer the program

    • Procedures

    • Call graph

    • Specifications for I/O behavior of the program and all procedures

  • Capture and publish an annotated analysis of the virus

  • Comparative analysis of related viruses

Genome analysis
Genome Analysis

  • Example: M. tuberculosis genome

  • Given: 4.4Mbp of DNA (genome)

  • Infer:

    • Molecular parts list of Mtb

    • A model of the biochemical machinery of Mtb cell

  • DNA is a blueprint for the program of life


4.4Mbyte binary program

4.4Mbp DNA sequence

Step 1
Step 1

Distinguish code from data segments

Find procedure boundaries

Distinguish coding from non-coding regions –

Gene Finding

Step 2
Step 2

Predict semantics of procedures





Predict gene functions

Step 3
Step 3

Predict procedure call graph













Predict biochemical and gene networks

Step 4
Step 4

Predict conditions under which procedures are invoked








Predict expression of network fragments

Step 5
Step 5

Infer complete program specification

Formulate dynamic cellular simulation

Step 6
Step 6

Internet publishing of structured program

annotation with explanations, references,


Internet publishing of structured genome

annotation with explanations, references,


Step 7
Step 7

Comparative analysis of viruses

Evolutionary relationships among viruses

Comparative analysis of genomes

Evolutionary relationships among genomes

Step 8
Step 8

Identify measures to disable virus or prevent its spread








Identify target proteins for anti-microbial drug discovery

Database of viruses
Database of Viruses

  • Create a database that stores

    • Binaries for all viruses

    • All annotation of virus programs by different investigators

    • Comparative analyses

  • Support

    • Remote API access

    • Click-at-a-time browsing

Reference on major genome databases
Reference on Major Genome Databases

  • Nucleic Acids Research Database Issue


    • 112 databases

What are database goals and requirements
What are Database Goals andRequirements?

  • How many users?

  • What expertise do users have?

  • What problems will database be used to solve?

What is its organizing principle
What is its Organizing Principle?

  • Different DBs partition the space of genome information in different dimensions

  • Experimental methods (Genbank, PDB)

  • Organism (EcoCyc, Flybase)

What is its level of interpretation
What is its Level of Interpretation?

  • Laboratory data

  • Primary literature (Genbank)

  • Review (SwissProt, MetaCyc)

  • Does DB model disagreement?

What are its semantics and content
What are its Semantics and Content?

  • What entities and relationships does it model?

  • How does its content overlap with similar DBs?

  • How many entities of each type are present?

  • Sparseness of attributes and statistics on attribute values

What are sources of its data
What are Sources of its Data?

  • Potential information sources

    • Laboratory instruments

    • Scientific literature

      • Manual entry

      • Natural-language text mining

    • Direct submission from the scientific community

      • Genbank

  • Modification policy

    • DB staff only

    • Submission of new entries by scientific community

    • Update access by scientific community

What dbms is employed
What DBMS is Employed?

  • None

  • Relational

  • Object oriented

  • Frame knowledge representation system

Distribution user access
Distribution / User Access

  • Multiple distribution forms enhance access

  • Browsing access with visualization tools

  • API

  • Portability

What validation approaches are employed
What Validation Approaches areEmployed?

  • None

  • Declarative consistency constraints

  • Programmatic consistency checking

  • Internal vs external consistency checking

  • What types of systematic errors might DB contain?

Database documentation
Database Documentation

  • Schema and its semantics

  • Format

  • API

  • Data acquisition techniques

  • Validation techniques

  • Size of different classes

  • Coverage of subject matter

  • Sparseness of attributes

  • Error rates

Relationship of database field to bioinformatics
Relationship of Database Field toBioinformatics

  • Scientists generally ignorant of basic DB principles

    • Complex queries vs click-at-a-time access

    • Data model

    • Defined semantics for DB fields

    • Controlled vocabularies

    • Regular syntax for flatfiles

    • Automated consistency checking

  • Most biologists take one programming class

  • Evolution of typical genome database

  • Finer points of DB research off their radar screen

  • Handfull of DB researchers work in bioinformatics

Database field
Database Field

  • For many years, the majority of bioinformatics DBs did not employ a DBMS

    • Flatfiles were the rule

    • Scientists want to see the data directly

    • Commercial DBMSs too expensive, too complex

    • DBAs too expensive

  • Most scientists do not understand

    • Differences between BA, MS, PhD in CS

    • CS research vs applications

    • Implications for project planning, funding, bioinformatics research


  • Teaching scientists programming is not enough

  • Teaching scientists how to build a DBMS is irrelevant

  • Teach scientists basic aspects of databases and symbolic computing

    • Database requirements analysis

    • Data models, schema design

    • Knowledge representation, ontologies

    • Formal grammars

    • Complex queries

    • Database interoperability