1 / 42

Highly Scalable Algorithms for Robust String Barcoding

Highly Scalable Algorithms for Robust String Barcoding. Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering Department University of Connecticut Storrs, CT * Department of Computer Science University of Illinois at Chicago Chicago, IL.

mcmillana
Download Presentation

Highly Scalable Algorithms for Robust String Barcoding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta* Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering Department University of Connecticut Storrs, CT *Department of Computer Science University of Illinois at Chicago Chicago, IL

  2. Motivation • There are many critical situations when one needs to rapidly identify an unknown genomic sequence from among a given set of known sequences • Rapid identification of pathogens in epidemic outbreaks • Monitoring of microbial communities, e.g., in environmental studies • Fast database search

  3. Possible Approaches • Sequencing based: sequence the unknown DNA sequence, then use similarity search programs such as BLAST to identify the unknown virus sequence for pathogens in database • Sequencing is prohibitively expensive and time consuming • Hybridization Based: identify the unknown sequence by testing for the presence of certain subsequences • Subsequence tests can be performed quickly and at low cost using a variety of hybridization based methods

  4. Sequence Fingerprints • For each sequence, find a subsequence that appears in that sequence and only in that sequence • Sequence barcodes: 0/1 vectors • When using fingerprints, barcode length = #sequences

  5. String Barcoding • (Borneman et al.’01, Rash & Gusfield’02): Unique occurrence of tested subsequences not needed, as long as 0/1 barcodes are unique • When using non-unique subsequences, barcode length can be much smaller than #sequences

  6. Overview • Problem Formulation and Previous Work • Greedy Setcover Algorithm • Experimental Results • Conclusions

  7. Problem Definition Given: Genomic sequences g1,…, gn Find: Minimum number of distinguisher strings t1,…,tk Such that: For every gi gj there exists a distinguisher tl which is a substring of gi or gj but not of both - At least log2n distinguishers needed - n distinguishers are always sufficient

  8. Computational Complexity • [Berman et al.’04] Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))

  9. Rash & Gusfield Integer Program • A non-redundant set of candidate distinguishers is generated using a suffix tree • One variable vi for each candidate distinguisher xi • vi = 1  xi is selected • vi = 0  xi is not selected

  10. Integer Program Example Minimize VTG + VATGGA+ VCAGT + VTTC +VGTGC#objective function Such that VTG+ VTTC + VGTGC >= 1#constraint to cover pair 1,2 VATGGA+ VCAGT + VGTGC >= 1#constraint to cover pair 1,3 VTG+ VATGGA + VCAGT + VTTC >= 1#constraint to cover pair 2,3 Binaries #all variables are 0/1 VTG VATGGAVCAGTVTTCVGTGC End

  11. Limitations of Integer Program Method • Works only for moderately sized datasets • 50-150 sequences • Average sequence length ~1000 nucleotides • Up to 4 hours needed to come within 20% of optimum

  12. Information Content Heuristic • [Berman et al. 2004] • Keep track of the partition defined by distinguishers selected so far • In every step, choose candidate that reduces partition entropy by largest amount • Theorem: Information Content Heuristic is always finding a #distinguishers within 1+ln(n) of optimum

  13. Limitations of ICH • Real genomic sequences contain degenerate nucleotides (e.g., N for any of {A,T,C,G}) due to sequencing errors and known single nucleotide polymorphisms • Distinguisher-to-sequence matches: • Perfect matches • Perfect mismatches • Uncertain matches • Information Content cannot be defined in the presence of uncertain matches

  14. Other Heuristics • (Cazalis et al 2004): greedy setcover, simulated annealing, and genetic algorithms for distinguisher selection • To achieve practical running time, only a small random subset (2000 candidates) of all candidate distinguishers is considered • No data provided on the loss of solution quality due to this restriction

  15. Overview • Problem Formulation and Previous work • Greedy Setcover Algorithm • Experimental Results • Conclusions

  16. Setcover Greedy Heuristic • Phase I: Candidate Generation • Generate a representative set of candidate distinguishers from the source sequences • Phase II: Greedy Distinguisher Selection • In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs

  17. Candidate Generation • A set of candidate distinguishers guaranteed to contain an optimum solution is generated from the sequences • We do not generate certain redundant candidates • A candidate is redundant if there is another candidate that appears exactly in the same set of sequences • For every sequence we generate only one of the substrings that appear exclusively in that sequence

  18. Efficient Candidate Generation • Our implementation uses simple array datastructures • We generate candidates in increasing order of length • Exact match positions for candidates of length l-1 used to generate the exact matches for candidates of length l • Candidates that do not satisfy individual given biochemical constraints, such as minimum/maximumlength, GC content, melting temperature, are discarded

  19. Setcover Greedy Heuristic • Phase I: Candidate Generation • Generate a set of candidate distinguishers from the source sequences • Phase II: Greedy Distinguisher Selection • In every step, choose candidate that distinguishes the largest number of not yet distinguished pairs

  20. Distinguisher Selection as Set Cover • Set Cover Problem: given a universal set U and a family of subsets, find a minimum number of subsets covering U • Distinguisher selection is a special case of set cover: • Elements to be covered are the pairs of sequences • Each candidate distinguisher defines a set of pairs that it separates • By a classical result, the greedy algorithm has an approximation factor of 1+ln(|U|) Setcover greedy has approximation factor of 2*ln(n) for distinguisher selection with n sequences

  21. Distinguisher Selection • Start with an empty set D of distinguishers • While there are pairs of sequences not yet distinguished, do: • Compute for each remaining candidate c its coverage gain (c, D)– the number of not yet distinguished pairs of sequences that are distinguished by c • Add the candidate with maximum coverage gain to D • Return D

  22. Computation of (c, D): CATCAGA TTCAGT TAT AATCAG AATAG D = { }

  23. Computation of (c, D): c=TCAG CATCAGA TTCAGT TAT AATCAG AATAG D = { }

  24. Computation of (c, D): c=TCAG (c, D)= 3 x (5 –3) = 6 CATCAGA TTCAGT TAT AATCAG AATAG D = { }

  25. Computation of (c, D): CATCAGA TTCAGT TAT AATCAG AATAG D = {TCAG}

  26. Computation of (c, D): CATCAGA TTCAGT TAT c=AAT AATCAG AATAG D = {TCAG}

  27. Computation of (c, D): (c,D)= 1 x (2-1) + 1 x (3-1) = 3 CATCAGA TTCAGT TAT c=AAT AATCAG AATAG D = {TCAG}

  28. Computation of (c, D): CATCAGA TTCAGT TAT AATCAG AATAG D = {TCAG,AAT}

  29. Computation of (c, D) • S1, S2, …, Sk are the subsets in the partition defined by D • Mc is the set of matches of candidate c • Using simple datastructures, computation can be done in linear time (in the number of sequences)

  30. Lazy Update of Gains • Coverage gains are monotonically non-increasingduring the algorithm • Re-compute coverage gain for a candidate only if last saved gain is higher than the gain of current best candidate • In practice this speeds-up the selection algorithm by a factor of ~2

  31. Algorithm Extensions • Degenerate bases • A pair of sequences is separated by candidate c if • c has at least one perfect match with one of the sequences, and • c has perfect mismatches at all positions of the other sequence • Gain computation done in O(n2) time using a simple coverage matrix data-structure • Redundancy r • A pair of sequences is counted in the gain function until r distinguishers separate it • Distinguisher cross-hybridization • Minimum edit distance, or maximum common substring weight, bound for every pair of selected distinguishers • Candidates incompatible with a selected distinguisher removed from candidate list

  32. Overview • Problem Formulation and Previous work • Greedy Setcover Algorithm • Experimental Results • Conclusions

  33. Testcases • Randomly generated instances • Equal probabilities assigned to each of the four nucleotides • Microbial genomes extracted from NCBI databases • Sequence lengths between 490 Kbases to 4.75 Mbases • Small number of degenerate bases

  34. Selection time, L=10k, r=1 basic – O(n2) computation of gains using matrix datastructure partition – O(n) computation of gains using partition-based datastructure

  35. Candidate Sampling, n=1000, L=10k, r=1

  36. Comparison to ICH, L=10k, r=1

  37. Varying Redundancy, L=10k

  38. NCBI testcase • 20 NCBI microbial genomic sequences • Distinguisher melting temperature range of 55-60 oC • GC content range of 40-60% • Max common subsequence weight bound of 5 • weight(A)=weight(T)=1, weight(C)=weight(G)=2

  39. NCBI testcase, r=1 GATTCGAACCCCCGA AACTGTCTCACGACGTTCTGAA CACAAGGAGTGAGTGTTGC GTGGATGCCTTGGCA AAGCGCGTCGCAAA GGACTACCAGGGTATCTAATCCTG AAAGAAGATAGAGCAGCAGCT CCATTGACAATTTCAACACC CGGTTTTGTGCTTCATGG

  40. Results on 29 Microbial Sequences (76 Mb)

  41. Overview • Problem Formulation and Previous work • Greedy Setcover Algorithm • Experimental Results • Conclusions

  42. Conclusions • We provided highly scalable algorithms for the robust string barcoding problem, capable of handling whole genomic sequences of up to bacterial size • Distinguisher selection based whole genomic sequences results in a number of distinguishers nearly matching the information theoretic lower bounds for the problem • The software can be used online at http://dna.engr.uconn.edu/~software/DNA-BAR/

More Related