250 likes | 404 Views
Counting Suffix Arrays and Strings. Text to be indexed:. T. C. T. T. C. T. C. T. T. C. T. C. $. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Suffix Array Data Structure. Suffix Array – lexicographically sorted list of all suffixes:. 13 - $ 12 - C$ 10 - CTC$
E N D
Text to be indexed: T C T T C T C T T C T C $ 1 2 3 4 5 6 7 8 9 10 11 12 13 Suffix Array Data Structure Suffix Array – lexicographically sorted list of all suffixes: 13 - $ 12 - C$ 10 - CTC$ 5 - CTCTTCTC$ 7 - CTTCTC$ 2 - CTTCTCTTCTC$ 11 - TC$ 9 - TCTC$ 4 - TCTCTTCTC$ 6 - TCTTCTC$ 1 - TCTTCTCTTCTC$ 8 - TTCTC$ 3 - TTCTCTTCTC$ Dagstuhl, May 2006 - Jens Stoye
Overview • Classify strings sharing same suffix array • Counting strings sharing same suffix array • Counting suffix arrays Lower bound suffix array compression • Summation identities Dagstuhl, May 2006 - Jens Stoye
1. Classify Strings for Suffix Array t - string of length n, P - permutation of {1,..., n}, R - inverse of P. Theorem: P is the suffix array of tif and only if for all i{1,...,n} • t[P[i]] t[P[i+1]] and • t[P[i]] = t[P[i+1]] R[P[i]+1] R[P[i+1]+1] same as • R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]] Dagstuhl, May 2006 - Jens Stoye
t = A A B C B 1 2 3 4 5 i t[P[i]] P[i] 1 1 A ABCB 2 A BCB 2 B 5 3 4 B CB 3 5 C B 4 1. Classify Strings for Suffix Array a) t[P[i]] t[P[i+1]] and b) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]] Text to be indexed: R+-descent Dagstuhl, May 2006 - Jens Stoye
t2 t t3 = = = A A A B A A C B D E C D D C B 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 i t2[P[i]] t3[P[i]] P[i] t[P[i]] A ABCB 1 A ACDC 1 A BDED A BCB A CDC 2 2 B DED 5 C D B 3 C DC D ED B CB 3 4 5 D C 4 E D C B 1. Classify Strings for Suffix Array Equivalences between strings Text to be indexed: (order-equivalent) (order-distinct) Dagstuhl, May 2006 - Jens Stoye
t2 t3 t = = = A A A A A B B C D E D C B C D 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 t[P[i]] t[P[i]] t2[P[i]] t3[P[i]] i P[i] A A 1 1 A + 0 = A + 0 = + 0 = A + 1 = 2 A A B 2 + 2 = 3 C B + 1 = D B 5 + 2 = + 1 = B B C D 3 4 C D + 1 = E 5 4 C + 2 = 2. Counting Strings for Suffix Array Text to be indexed: Base string Non-decreasing sequences Dagstuhl, May 2006 - Jens Stoye
2. Counting Strings for Suffix Array Suffix array P of length n with dR+-descents. Number of strings over alphabet of size afor P = Number of non-decreasing sequences overa-d elements Dagstuhl, May 2006 - Jens Stoye
2. Counting Strings for Suffix Array Suffix array P of length n with dR+-descents. Number of strings composed of exactly k distinct characters for P is Dagstuhl, May 2006 - Jens Stoye
Number of strings over alphabet size 20 for suffix arrays of length n with 10R+-descents: 2. Counting Strings for Suffix Array Dagstuhl, May 2006 - Jens Stoye
2. Counting Strings for Suffix Array Suffix array P of length n with dR+-descents • Number of order-distinct strings over alphabet of size a is • Number of order-distinct strings where all k distinct characters must appear is Dagstuhl, May 2006 - Jens Stoye
3. Counting Suffix Arrays Definition: Let P permutation of {1,..., n}. Position i{1,...,n-1} is a permutation descent if P[i] > P[i+1]. Definition: The Eulerian number gives the number of permutations of {1,...,n} with exactly d permutation descents. Dagstuhl, May 2006 - Jens Stoye
3. Counting Suffix Arrays Well-known fact: Recursive enumeration of Eulerian numbers • , • for n d, and Dagstuhl, May 2006 - Jens Stoye
3. Counting Suffix Arrays Definition: Let A(n,d) be the number of permutations of length n with dR+-descents. Observation: • A(n,0) = 1 • A(n,d) = 0 for n d • see next Dagstuhl, May 2006 - Jens Stoye
t = A A B C B 1 2 3 4 5 At = A A A B C B 1 2 3 4 5 6 i Pt[i] t[P[i]] A ABCB 1 1 i PAt[i] At[P[i]] A BCB 2 2 1 1 A AABCB 3 B 5 2 2 A ABCB 4 B CB 3 3 3 A BCB 4 5 C B 4 6 B 5 4 B CB 6 5 C B 3. Counting Suffix Arrays Text to be indexed: (d+1) possible positions without additional R+-descent Dagstuhl, May 2006 - Jens Stoye
t = A A B C B 1 2 3 4 5 Bt = B A A B C B 1 2 3 4 5 6 i Pt[i] t[P[i]] A ABCB 1 1 i PBt[i] Bt[P[i]] A BCB 2 2 1 2 A ABCB 3 B 5 2 3 A BCB 4 B CB 3 3 6 B 4 5 C B 4 1 B AABCB 5 4 B CB 6 5 C B 3. Counting Suffix Arrays Text to be indexed: (d+1) possible positions without additional R+-descent Dagstuhl, May 2006 - Jens Stoye
3. Counting Suffix Arrays Together: • A(n,0) = 1, • A(n,d) = 0 for n d, and • A(n,d) = (d+1) A(n-1,d) + (n-d) A(n-1,d-1) Theorem: The number A(n,d) of permutations of length n with d R+-descents is the Eulerian number . Dagstuhl, May 2006 - Jens Stoye
3. Counting Suffix Arrays • The number of distinct suffix arrays of length n for strings over alphabet of size a: • Lower bound for compressibility of suffix arrays in the Kolmogorov sense: Dagstuhl, May 2006 - Jens Stoye
3. Counting Suffix Arrays • Number of distinct suffix arrays of length n for strings over alphabet of size 20: Dagstuhl, May 2006 - Jens Stoye
3. Counting Suffix Arrays • Number of distinct suffix arrays of length n for strings over alphabet of size 4: Dagstuhl, May 2006 - Jens Stoye
4. Summation Identities • Worpitzki‘s identityby summing up the number of strings of length n for each suffix array: • Summation rule for Eulerian numbers to generate the Stirling numbers of second kind: Dagstuhl, May 2006 - Jens Stoye
Summary • Constructive proofs to count strings sharing the same suffix array • Constructive proof to count distinct suffix arrays yielding lower bound for suffix array compression • Constructive proofs for Worpitzki‘s identity and the summation rule of Eulerian numbers to count Stirling numbers of second kind Dagstuhl, May 2006 - Jens Stoye
Outlook • Efficient enumeration algorithm for suffix arrays • Compressed suffix arrays for fast querying in bioinformatics applications • Average case analysis under non-uniform model Dagstuhl, May 2006 - Jens Stoye
Thank you for your attention!