1 / 37

Querying Sequence Databases

Algorithms for Comparative Sequence Analysis. Summer 2013. Querying Sequence Databases. Tamer Kahveci CISE Department University of Florida. What is Database Search ?. Many long sequences. One giant sequence. query. query. What is Database Search ?. Two giant sequences.

coy
Download Presentation

Querying Sequence Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Comparative Sequence Analysis Summer 2013 Querying Sequence Databases Tamer Kahveci CISE Department University of Florida

  2. What is Database Search ? Many long sequences One giant sequence . . . query query

  3. What is Database Search ? Two giant sequences

  4. Database Search Issues • How can we search massive space quickly? • How can we evaluate the significanceof the result?

  5. Database Search Methods • Hash table based methods • FASTA family • FASTP, FASTA, TFASTA, FASTAX, FASTAY • BLAST family • BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST • Others • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS • Suffix tree based methods • Mummer, AVID, Reputer, MGA, QUASAR

  6. Hash Table

  7. Hash Table • K-gram = subsequence of length K • Ak entries • A is alphabet size • Linear time construction • Constant lookup time

  8. FASTP Lipman & Pearson, 1985

  9. FASTP • Three phase algorithm • Find short good matches using k-grams • K = 1 or 2 • Find start and end positions for good matches • Use DP to align good matches

  10. FASTP: Phase 1 (1) position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offset amino acid protein A protein B pos A - posB ----------------------------------------------------- a 6 6 0 c 2 7 -5 k - 11 n 1 - p 4 9 -5 r - 10 s 3 8 -5 t 5 - ----------------------------------------------------- Note the common offset for the 3 amino acids c,s and p A possible alignment can be quickly found : protein 1 n c s p t a | | | protein 2 a c s p r k

  11. FASTP: Phase 1 (2) • Similar to dot plot • Offsets range from 1-m to n-1 • Each offset is scored as • # matches - # mismatches • Diagonals (offsets) with large score show local similarities

  12. FASTP: Phase 2 • 5 best diagonal runs are found • Rescore these 5 regions using PAM250. • Initial score • Indels are not considered yet

  13. FASTP: Phase 3 • Sort the aligned regions in descending score • Optimize these alignments using Needleman-Wunsch • Report the results

  14. FASTP - Discussion • Results are not optimal. Why ? • How does performance compare to Smith-Waterman? • What is the impact of k? • How does this idea work for DNAs ? • K = 4 or 6 for DNA

  15. FASTA – Improvement Over FASTP Pearson 1995

  16. FASTA (1) • Phase 2: Choose 10 best diagonal runs instead of 5

  17. FASTA (2) • Phase 2.5 • Eliminate diagonals that score less than some given threshold. • Combine matches to find longer matches. It incurs join penalty similar to gap penalty

  18. BLAST Altschul, Gish, Miller, Myers, Lipman, 1990

  19. BLAST (or BLASTP) • BLAST – Basic Local Alignment Search Tool • An approximation of Smith-Waterman • Designed for database searches • Short query sequence against long database sequence or a database of many sequences • Sacrifices search sensitivity for speed

  20. MCGPFILGTYC CGP MCG BLAST Algorithm (1) • Eliminate low complexity regions from the query sequence. • Replace them with X (protein) or N (DNA) • Hash table on query sequence. • K = 3 for proteins

  21. BLAST Algorithm (2) • For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62 • 20k candidates • ~50 on the average per k-gram • ~50n for the entire query • Build hash table PQGMCGPFILGTYC QGM PQG PQG PQG 18 PEG 15 PRG 14 PSG 13 PQA 12 T = 13

  22. BLAST Algorithm (3) • Sequentially scan the database and locate each k-gram in the hash table • Each match is a seed for an ungapped alignment.

  23. BLAST Algorithm (4) • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, X

  24. BLAST Algorithm (5) • Keep only the extended matches that have a score at least S. • Determine the statistical significance of the result

  25. What is Statistical Significance? • Two one-on-one games, two scores. • Which result is more significant? • Expected: maybe a random result. • Unexpected: significant, may have significant meanings. 13 : 15 13 : 15

  26. Statistical Significance • E-value: The expected number of matches with score at least S • E = Kmne-lambda.S • m, n : sequence lengths • S : alignment score • K, lambda: normalization parameters • P-value: The probability of having at least one match with score at least S • 1 – e-E • The smaller these values are, the more significant the result • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

  27. K (k-gram) Lower: more sensitive. Slower. T (neighbor cutoff) Lower: Find distant neighbors. Introduces noise X (extension cutoff) Higher: lower chances of getting into a local minima. Slower. BLAST - Analysis

  28. Sample Query • http://www.ncbi.nlm.nih.gov/BLAST/ Dhal_ecoli I D R A M S A A R G V F E R G D W S L S S P A K R K A V L N K L A D L M E A H A E E L A L L E T L D T G K P I R H S L R D D I P G A A R A I R W Y A E A I D K V Y G E V A T T S S H E L A M I V R E P V G V I A A I V P W N F P L L L T C W K L G P A L A A G N S V I L K P S E K S P L S A I R L A G L A K E A G L P D G V L N V V T G F G H E A G Q A L S R H N D I D A I A F T G S T R T G K Q L L K D A G D S N M K R V W L E A G G K S A N I V F A D C P D L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R L L L E E S I A D E F L A L L K Q Q A Q N W Q P G H P L D P A T T M G T L I D C A H A D S V H S F I R E G E S K G Q L L L D G R N A G L A A A I G P T I F V D V D P N A S L S R E E I F G P V L V V T R F T S E E Q A L Q L A N D S Q Y G L G A A V W T R D L S R A H R M S R R L K A G S V F V N N Y N D G D M T V P F G G Y K Q S G N G R D K S L H A L E K F T E L K T I W I

  29. BLASTN • BLAST for nucleic acids • K = 11 • Exact match instead of neighborhood search.

  30. Even More Variations • PsiBLAST (iterative) • BLAT, BLASTZ, MegaBLAST • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS • Main differences are • Seed choice (k, gapped seeds) • Additional data structures

  31. Suffix Trees

  32. Suffix Tree • Tree structure that contains all suffixes of the input sequence • TGAGTGCGA • GAGTGCGA • AGTGCGA • GTGCGA • TGCGA • GCGA • CGA • GA • A

  33. Suffix Tree Example

  34. Suffix Tree Analysis • O(n) space and construction time • 10n to 70n space usage reported • O(m) search time for m-letter sequence • Good for • Small data • Exact matches

  35. Suffix Array • 5 bytes per letter • O(m log n) search time • Better space usage • Slower search

  36. Mummer

  37. Other Sequence Comparison Tools • Reputer, MGA, AVID • QUASAR (suffix array)

More Related