1 / 15

Approximate String Matching using Compressed Suffix Arrays

Approximate String Matching using Compressed Suffix Arrays. T.N.D. Huynh, W.K. Sung National University of Singapore W.K. Hon , T.W. Lam The University of Hong Kong. String Matching Problem.

oakley
Download Presentation

Approximate String Matching using Compressed Suffix Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate String Matching using Compressed Suffix Arrays T.N.D. Huynh, W.K. Sung National University of Singapore W.K. Hon, T.W. Lam The University of Hong Kong

  2. String Matching Problem • Given a text T of length n over an alphabetΣ, a pattern P of lengthm, find all occurrences of P inside the text T • E.g.,T = barbara P = bar  2 occurrences, at position 1 and 4 in T

  3. Index for String Matching • Often, T is given ahead, which is going to be matched with various P later • Also, n » m. E.g., T = Human Genome ~ 3 * 109 P = Gene ~ 103  It pays to waste some space to build an index for T that will facilitate later matching

  4. Index for String Matching [Examples]

  5. k-Approximate String Matching • Find all occurrences of P in T that have at most k “errors” (mismatch, edits) from P • E.g., T = barbara P = rba  5 occurrences, at positions 1 (delete r from P), 2 (insert atoP), 3 (match), 4 (delete r from P), 6 (delete b from P)

  6. Previous Work & Our Result (k=1)

  7. Our Index • Our index is Suffix Array + Inverse • Definition 1: The suffix array of T is an array SA such that SA[i] stores the starting position of the i-th smallest suffix of T

  8. An Example of Suffix Array • E.g., T = barbara

  9. Our Index: Suffix Array + Inverse • Lemma 1: Given a pattern P. Suppose P occurs in T. Then all (exact) occurrences of P corresponds toa range, say[st,ed], in SA such that SA[st], SA[st+1], …, SA[ed] are position of all such occurrences.

  10. Our Index: Suffix Array + Inverse • Lemma 2: Given the range [st1,ed1] for P1and the range [st2,ed2] for P2. Then, the range [st,ed] for P1P2 can be found in O(log n) time, based on SA and its inverse. • Idea of proof: Similar to Manber & Myers’ algorithm, using binary search.

  11. Our Index: Suffix Array + Inverse • Corollary 3: Given the range [st,ed] for P, and an array C such that C[c] stores the total occurrences of a character in T that is smaller than the character c. Then, the range of cP can be found in O(log n) time. • Proof: Directly follows from Lemma 2 since [C[c-1]+1, C[c]] is the range of SA that corresponds to c.

  12. 1-Approximate Matching Algorithm [The delete case] • Find the range [sti,edi] for P[1…i], for every i  [1,m] • Find the range [sti’,edi’] for P[i…m], for every i  [1,m] • For every i  [1,m], find the range of P[1…i-1] P[i+1..m]. Report the occurrences.  Time complexity: O(m log n + occ)

  13. 1-Approximate Matching Algorithm • For the mismatch case or other edit cases, the algorithm is similar, except that in Step 3, we have to find the range for |Σ|m strings (instead of m strings in the delete case).  Time complexity: O(|Σ|m log n+occ)

  14. The General Case • Our algorithm can be extended to solve the general k-approximate matching problem. The time complexity will be: O(|Σ|k mk log n + occ) • Further, if we replaceSA + InversebyCSA of Grossi & Vitter,the space becomes O(n) bits, and the time will be blown up by an O(log n) factor

  15. Future Work • Can we improve the time to O(m + occ) for the 1-approximate matchingproblem?

More Related