1 / 40

The MoBIoS Project Mo lecular B iological I nformation S ystem

M o B I o S M o B I o S. S o I B o M S o I B o M. The MoBIoS Project Mo lecular B iological I nformation S ystem. Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas.

orea
Download Presentation

The MoBIoS Project Mo lecular B iological I nformation S ystem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. M o B I o S M o B I o S S o I B o M S o I B o M The MoBIoS ProjectMolecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang

  2. Problem:In Life Sciencses, database management systems (DBMS) serve as glorified file managers. • Little use of sophisticated data and pattern-based retrieval • Real scientific and technological problems

  3. Primary data is stored in text or blob fields Annotations may be relational Data retrieval Filter DB, sequential dump, O(n), to utilities E.g. BLAST, When biological data is put in to an RDBMS

  4. Linear Data Scans, O(n), Endemic in Life Sciences • Sequences: • DNA, RNA, Protein databases • Mass Spectra • proteomics • Small Molecules & Protein Structure • Protein interaction • Rational drug design • Pathways (graphs) • Phylogenies (graphs, trees in particular)

  5. Scope: To Find Common Ground Both Biology and DBMS’ Have to Move DBMS Biological Information System Metric-Space Database as the Common Ground

  6. Metric Space is • a pair, M=(D,d), where • D is a set of points • d is [metric] distance function with the following properties: • d(x,y) = d (y,x) (symmetry) • d(x, y) > 0, d(x,x) = 0 (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality) x y z

  7. A Spatial Database Management System: Extend relational DBMS Special indexes for 2D and 3D data; k-d and R-trees New data types Geographic information systems Topographic maps Buildings and the like A Metric-Space Database Management System Extend Relational DBMS Special indexes for metric-spaces New data types Biological information system Life science data types Definition - By Analogy

  8. Develop index structures to support distance & nearest-neighbor queries • Well studied in main-memory • But by no means a closed problem • In databases (external/disk based methods) • Embryonic • Many myths • Often assumed to be the basis of multimedia database systems

  9. How to build a metric-space index • Three algorithmic classes [Tasan, Ozsoyoglu 04] • Vantage points • Hyperplanes • Bounding spheres

  10. Vantage Point Method [Burkhard&Keller73]

  11. Vantage Point Method Choose a point,VP And a radius, R

  12. Vantage Point Method • Given VP, R • The predicates • d(VP,x) < R • d(VP,x)  R • Divide the set into two equal halves • apply recursively Choose a point,VP And a radius,R

  13. Query, q, range r r q

  14. Query, q, range r • if • d(q,VP) > R + r • then • all neighbors are outside the sphere VP R r q

  15. Multi-vantage point method

  16. Multi-vantage point method • Consider d(VPi, x) a projection onto an axis • Looks like a k-d tree • Choose number k & d

  17. Myths • Solved problem; M-trees [Ciaccia et.al. 96, 97] • I can’t get them to work on anything but their original synthetic data generator • Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering) • Might be true for euclidean spaces • Early result, not true for our data • High dimensional indexing always asymptotically reduces to linear scans. • Formal result based on an assumption of uniform data distributions.

  18. Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT Comparison of Three Methods of Metric-Space Indexing

  19. Open problems • Is there a general metric-space index structure that is generally good for most work loads. • We are optimistic mvp tree’s – further tuning will be a useful answer • Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine. • No work addresses clustering data pages on disk. • Metric-space join algorithms

  20. Biological Models are Usually Based on Similarity Similarity • Biologist like scoring functions that reward each similar feature with a positive number • Intuitive Distance: • More Similar  smaller numbers • Identical  0

  21. But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models .

  22. Sequence Problem 1 Sequence similarity based on weighted edit distance Accepted weight matrices, PAM & BLOSSUM, are not metric • Log-odd matrices – negative values • Defy simple algebraic normalization[TaylorJones93,Linialetal97]

  23. Our First Result: mPAM [Xu&Miranker04] • Dayhoffetal’s PAM Derivation[74] • Took a set of closely related protein sequences • Developed a phylogenetic tree • Counted substitutions to transform one sequence to another • Tree determines a measure of time

  24. PAM vs. mPAM: t = 1/f Using original substitution counts • PAM: frequency of substitution S(a,b|t) = log P(b|a,t)/qb • mPAM: expected time between substitutions D(a,b) = 1/log(1 – (P(a,x)P(b,x)) x

  25. Sequence Problem 2 • Sequences long units (identity for storage and retrieval) • Genes • Chromosomes • Analysis comprises comparing small substrings

  26. Soln: Sequence View • New view type • Breaks sequences into q-grams create SEQUENCEVIEW rice_sview as SELECT CREATE FRAGMENTS (…, 3, 1) FROM … WHERE … USING HAMMING-DISTANCE

  27. Materialize as an Index D(AAA) ≤ 2 { {

  28. Status • Started with McKoi • A Java open source object-relational DBMS • (Think of Postgress written in Java) • Added • Biological data types • Metric-space index • Extending SQL engine (in progress)

  29. Compare Arabidopsis Genome X Rice Genome Locate nucleotide patterns of form primer pair candidate Eliminate non-unique primer candidates Merge overlapping primer candidates Usual implementations O(n2), n = 109 Computed in MoBIoS Rice Arab. 18 Matching Nucleotides 18 Matching Nucleotides • Rice Gap 400 – 3000 Long • Arab. Gap 400 – 3000 Long

  30. mSQL Query to locate candidate primer pairs SELECT merge(R1.fragment, A1.fragment) FROM G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2 WHERE distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000 GROUP BY R1.fragment, A1.fragment;

  31. Query Plan Arab. Genome,O(n) Rice Genome, O(m) Offline:Build Sequence View O(n log n) Compare O(mlogn) Indexed Nested Loop Eliminate Duplicates Eliminate Low Complexity Primers (LZ compression) Merge Overlapping Primers ~10,000 conserved primer pairs candidates

  32. Preliminary Results • Found 13,418 possible primer pairs from MoBIoS • 100 best candidates BLASTed for matches in GenBank • 15 matched other plant genes and the primers • At least 2 of 15 showed potential after PCR amplification against Helianthus and Phalaenopsis.

  33. MoBIoS Architecture(Molecular Biological Information System)

  34. Analysing Mass-Spectra Spectrum = Histogram of Mass/Charge Ratios of a collection peptides Similarity = Shared peaks count = Inner Product (0100101) • (0111100) = 2

  35. Cosine Distance Approx. Inner Product Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2 shown store and retrieve mass-spectra • using cosine distance, and it scales

  36. mSQL Query for Protein Identification by Mass-Spec. Signature Database Look SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);

  37. Matching Electrostatic Shape of Molecules

  38. Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106 Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers Still benefit from grid-services: recluster MoBIoS Server New index Shape match (FEM) Distance(real) High speed I/O Mirror DB-Contents

  39. Hyper-planes [Ulhmann91] • If d(x,h1) < d(x,h2) then x assigned to h1 h1 x h2

  40. Develop a Hierarchical Clustering Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap • Inspired by R-trees C A E B D F

More Related