1 / 27

Introduction to Stringology

Introduction to Stringology. Like Zhang. Outlines. What is “Stringology”? How to perform string matching? KMP String Searching Booyer-Moore Algorithm Trie and Suffix Tree Approximate Pattern Matching Interesting Problems. Stringology?. Text algorithms; Algorithms on strings

albert
Download Presentation

Introduction to Stringology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Stringology Like Zhang

  2. Outlines • What is “Stringology”? • How to perform string matching? • KMP String Searching • Booyer-Moore Algorithm • Trie and Suffix Tree • Approximate Pattern Matching • Interesting Problems

  3. Stringology? • Text algorithms; Algorithms on strings Practical Problems: e.g. String matching, text compression Theoretical Problems: e.g. Symmetric string, repetitions, etc.

  4. Why “Stringology” • Isolated and brief description in most algorithm books • Rich content only accessible in academic papers or journals • Applicable to many applications including web search, intrusion detection, bioinformatics, multimedia, data compression, etc. • Fundamental of computer science is to understand binary strings

  5. At the beginning… Question: Given a string “abcdefghijk” and a pattern P, try to find if P exists in the given string. Solution: C++: String s=“abcdefghijk”; If( s.find(P,0)!=string:npos ) return true; else return false; Java: String s=“abcdefghijk”; If( s.indexOf(P)>0 ) return true; else return false;

  6. Do you care about performance? • What if the given string is 100GB and the pattern is 100MB? • What if the indexOf() and find() method are using brute force searching? Brute Force string searching (pseudo code): for(int i=0;i<s.Length;++i) //O(n) { compare(s[i, i+p.Length], p); //O(m) } Total Time: O(n*m) for 100GB data and 100MB pattern, takes around 277777 hours (32 Years) on a 10G Hz cpu –supposing comparison takes 1 clock

  7. Why it is slow? Problem of brute force string searching: • Same patterns have been processing multiple times e.g. S=“abedabcdfghij”, P=“abedabz”; 1st: abedabcdfghij, start at index 0 2nd: abedabcdfghij, start at index 1 3rd: abedabcdfghij, start at index 2 …

  8. KMP Algorithm • Knuth – Morris – Pratt Algorithm Proposed in 1977 Preprocessing searching pattern to avoid trivial comparisions e.g. For pattern “abedabz”, if we know the mismatching happens at z, and the maximum rollback location is from “abz”, we don’t need to shift the matching string one by one KMP: (i is the current location) … If ( S[i]!=P[j] ) i=I + lookupTable[j]; … Brute Force: (i is the current location) … If ( S[i]!=P[j] ) i=i+1; …

  9. Build KMP Table e.g. the pattern is “101101” Table[0]=1; Table[1]=1; Table[2]=1; Table[3]=2; // 1011011 Table[4]=3; // 1011011 Table[5]=3; // 1011011 Table[i]=k if P[i-k,i-1]==P[0,k-1] Otherwise, Table[i]=1

  10. Boyer-Moore Algorithm • Published in 1977 • The longer the pattern is, the faster it works • Starts from the end of pattern, while KMP starts from the beginning • Works best for character string, while KMP works best for binary string Live Demo: http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html

  11. Trie and Suffix Tree • KMP and Boyer-Moore - Preprocessing existing patterns - Searching patterns in input strings • Trie and Suffix Tree - Preprocessing existing strings (e.g. dictionary) - Searching input patterns in the build tree

  12. A Simple Non-Compact Trie For strings: BIG, BIGGER, BILL,GOOD, GOSH

  13. Compact Trie Shrink all chains leading to leaves

  14. Patricia Each Edge represent multiple characters

  15. Online Suffix Trie Building For each input character X Add X to all suffix leaves Make X as Suffix (if X cannot be found, add it to the root children)

  16. Build a Suffix Trie Online Given Text: abaab Step 1 (start from the end): a

  17. Build a Suffix Trie Online Step 2: Input character “b” a b (new suffix) b (new suffix)

  18. Build a Suffix Trie Online Step 3: Input character “a” a (existing suffix) b b a (new suffix) a (new suffix)

  19. Build a Suffix Trie Online Step 4: Input character “a” a b 7 a(new suffix) a b a(new suffix) a a(new suffix)

  20. Build a Suffix Trie Online Step 5: Input character “b” a b a b a(new suffix) a a b a(new suffix) b b

  21. Suffix Array String Searching U. Manber and G. Myers, “Suffix arrays: a new method for on-line string searches”, SIAM Journal on Computing, 1993 Another source: “Programming Pearls”, Ch.15 • Sort string by suffix (pointers) • Binary search

  22. Example of Suffix Array Search Existing string: Google Then we have the following suffixes: google oogle ogle gle le e e gle google le ogle oogle Search pattern “good” Compare with “le” Compare with “gle “good” != “google”, return false

  23. Performance Comparison Previous Question: Find the 100MB string in 100GB content, what’s the worst case time complexity? Brute Force: O(n*m) is about 32 years Suffix Array: • Quick Sort the 100GB: O(nlgn)=O(37*237) • Binary Search: O(m*lgn)=O(37*227) Total is about 38*237, about 10mins

  24. Approximate Pattern Question: “University” is the correct pattern, but we also allow typos, which means “Unversity” “Oniversity” “Univsitty” are also acceptable. Then find all acceptable patterns in the content. How?

  25. Distance Definition String s1 and s2 have distance K if s1 can be transformed to s2 by K steps. The steps can only be of the following actions: • Change a character • Insert a character • Delete a character e.g. String “wojtk” can be transformed to “wjeek” by 3 steps, then Distance(“wojtk”, “wjeek”)=3

  26. Distance Calculation Dynamic Programming (similar to Longest Common String Calculation) For s1[1,m], s2[1,n], 0<i<=m, 0<j<=n Distance( i, j ) =min{ Distance(i-1, j)+1, Distance(i, j-1)+1, Distance(i-1, j-1), f(s1[i], s2[j])} where f(a,b)= ( a==b)?0:1

  27. Final Thoughts • String searching is critical to most applications • A problem has to deal with unless you don’t care how the indexOf() is implemented • 2D pattern matching is the hot topic of image/video research e.g. object detection, face recognition, etc. • Many interesting questions available e.g. symmetric patterns, shortest common string • Be sure to answer those questions for Microsoft/Google/etc. interviews

More Related