1 / 57

Recuperació de la informació

Recuperació de la informació. Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://www-igm.univ-mlv.fr/~lecroq/string/index.html. Algorismes de:

faris
Download Presentation

Recuperació de la informació

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recuperació de la informació • Modern Information Retrieval (1999) • Ricardo-Baeza Yates and Berthier Ribeiro-Neto • Flexible Pattern Matching in Strings (2002) • Gonzalo Navarro and Mathieu Raffinot • http://www-igm.univ-mlv.fr/~lecroq/string/index.html Algorismes de: Cerca de patrons (exacta i aproximada) (String matching i Pattern matching) Indexació de textos: Suffix trees, Suffix arrays

  2. String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • k patterns ---> The algorithm depends on k, |p| and || • Extensions • Regular Expressions • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models

  3. Exact string matching: one pattern (text on-line) Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 Long. pattern 2 w 2 4 8 16 32 64 128 256

  4. Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  5. Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA

  6. Trie G T A T A T G Construct the trie of GTATGTA,GTAT,TAATA,GTGTA

  7. Trie G T A T A T G Construct the trie of GTATGTA,GTAT,TAATA,GTGTA

  8. Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G T A T A T G T A A T A A

  9. Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G T A T A T G G T A T A A T A Which is the cost?

  10. Set Horspool algorithm • How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns • Which is the next position of the window? a We shift until a is aligned with the first a in the trie not longer than lmin,or lmin

  11. Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=

  12. Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G T 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4

  13. Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4

  14. Set Horspool algorithm 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table Search for ATGTATG,TATG,ATAAT,ATGTG 2. Determine lmin=4 4. Find the patterns

  15. Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…

  16. Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…

  17. Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…

  18. Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…

  19. Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…

  20. Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA…

  21. Set Horspool algorithm G T A T A T G G T A T A A T A A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA… …

  22. Set Horspool algorithm G T A T A T G G T A T A A T A As more patterns we search for, shorter shifts we do! A 1 C 4 (lmin) G 2 T 1 Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACA… … Is the expected length of the shifts related with the number of patterns?

  23. Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG

  24. Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3

  25. Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3 1

  26. Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG AT 1 CA CC CG … A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG 3 3 3 3

  27. Set Horspool algorithm Wu-Manber algorithm 1 símbol 2 símbols AA 1 AC 3 (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 … AA 1 AT 1 GT 1 TA 2 TG 2 A 1 C 4 (lmin) G 2 T 1 How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG

  28. Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA

  29. Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA

  30. Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA

  31. Wu-Manber algorithm G T A T A T G G T A AA 1 AT 1 GT 1 TA 2 TG 2 T A A T A A Search for ATGTATG,TATG,ATAAT,ATGTG text: ACATGCTATGTGACATAATA But given k patterns, how many symbols we should take ? … log|Σ| 2*lmin*k

  32. Multiple string matching 8 | | (5 strings) Wu-Manber 4 SBOM lmin 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 strings) (100 strings) 4 SBOM 8 Wu-Manber Ad AC 2 SBOM 4 5 10 15 20 25 30 35 40 45 Ad AC 2 5 10 15 20 25 30 35 40 45 Wu-Manber 8 (1000 strings) SBOM 4 Ad AC 2 5 10 15 20 25 30 35 40 45

  33. BOM algorithm (Backward Oracle Matching) • How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor of any pattern • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata

  34. Factor Oracle of k strings A How can we build the Factor Oracle of GTATGTA, GTAA, TAATA i GTGTA ? G T A G T A T G T 1,4 A A T A 3 2

  35. Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G

  36. Factor Oracle of k strings A Given the Factor Oracle of GTATGTA G T T

  37. Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G T A T A

  38. Factor Oracle of k strings G Given the Factor Oracle of GTATGTA G T A T T A

  39. Factor Oracle of k strings T Given the Factor Oracle of GTATGTA G T A G T G T A

  40. Factor Oracle of k strings A 1 Given the Factor Oracle of GTATGTA G T A G T T G T A … we insert GTAA

  41. Factor Oracle of k strings A 2 …inserting GTAA G T A G T A T G T 1 A

  42. Factor Oracle of k strings Given the AFO of GTATGTA and GTAA G T A G T A T G T 1 A A 2 … we insert TAATA

  43. Factor Oracle of k strings A T A 3 … inserting TAATA G T A G T A T G T 1 A A 2

  44. Factor Oracle of k strings A Given the AFO of GTATGTA, GTAA and TAATA G T A G T A T G T 1 A A T A 3 2 …we insert GTGTA

  45. Factor Oracle of k strings A …inserting GTGTA G T A G T A T G T 1 A A T A 3 2

  46. Factor Oracle of k strings A G T A G T A T G T 1,4 A A T A 3 2 This is the Automata Factor Oracle of GTATGTA, GTAA, TAATA and GTGTA

  47. SBOM algorithm • How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern • Which is the next position of the window? The position determined by the last character of the text with a transition in the automata

  48. SBOM algorithm: example A We search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG … the we build the Automata Factor Oracle of GTATG, GTAAT, TAATA and GTGTA of length lmin=5 G T A G T T A 1 4 A G T A A T 2 3

  49. SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 text: ACATGCTAGCTATAATAATGTATG

  50. SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG G T A G T T A 1 4 A G T A A A T 2 3 text: ACATGCTAGCTATAATAATGTATG

More Related