1 / 24

Alternative Algorithms for Lyndon Factorization

This study explores alternative algorithms for Lyndon factorization, including Duval's algorithm and variations using LF-skip and run-length encoding. The algorithms are analyzed for efficiency and their application in string compression.

nfiorentino
Download Presentation

Alternative Algorithms for Lyndon Factorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alternative Algorithms for Lyndon Factorization Sukhpal Singh Ghuman, EmanueleGiaquinta, and JormaTarhio Aalto University Finland

  2. Lyndon Word • Given two strings w and w′, w′ is a rotation of w if w=uv and w′=vu, for some strings u and v. • A string is a Lyndon word if it is lexicographically (alphabetically) smaller than all its proper rotations.

  3. Lyndon Word • w=ab, w′=ba where u=a, v=b. • w is lexicographically smaller than its rotation w′ . • w is Lyndon word.

  4. Examples of Lyndon words • Lyndon words • a • ab • aabab • Non-Lyndon words • ba • abaac • abcaac

  5. Lyndon factorization • A word w can be factorized into w0 w1 w2 … wm-1 factors such that each factor is a Lyndon word. • Every string has a unique factorization in Lyndon words with corresponding sequence of factors is non-increasing with respect to lexicographical order. • The Lyndon factorization has importance in a recent method for sorting the suffixes of a text.

  6. Examples of Lyndon factorization • abcaabcaaabcaaaabc -> abcaabcaaabcaaaabc • aacaacaacaad -> aacaacaacaad • abacabab -> abac ab ab

  7. Duval’s algorithm • For Lyndon factorization of a word w, computes the longest prefix w1 of w = w1w′ which is a Lyndon word and then recursively restart the process from w′. • Non-empty prefixes of Lyndon words are all of the form (uv)ku. • Duval’s algorithm compute the factorization using a left to right parsing.

  8. Computing Lyndon factorization for T=aabaabaaac • For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. • Then there are two cases, depending on the next symbol to be read.

  9. Computing Lyndon factorization for T=aabaabaaac • For i=3 having P = aab. With u = emptystring, v = aab and k = 1. • The next symbol to read is 'a' and aabais still a prefix of a Lyndon word. The next iteration then starts with P =aaba.

  10. Computing Lyndon factorization for T=aabaabaaac • For i = 6, P = aabaab; P as (uv)k u with u = empty string, v = aab and k = 2. • The next symbol to read is 'a' and after reading 'aaa', it is found aabaabaaa is not a prefix of a Lyndon word. • Output is two times aab and the next iteration starts on the suffix aaac of T with P = a.

  11. Variations of Duval’s algorithm. • First variation is designed with LF skip algorithm. • Second variation is for strings compressed with run-length encoding.

  12. LF skip algorithm • The algorithm is able to skip a significant portion of the characters of the string if it contains runs of smallest character. • Let w be a word over an alphabet Σ with a factorization CFL(w) = w1,w2,...,wm .

  13. LF skip algorithm • Let c be the smallest symbol in Σ. • There exists k ≥ 2,i ≥ 1 such that ck is a prefix of wi. • If the last symbol of w is not c, then, c is a prefix of each of wi, wi+1, . . . , wm. • This property is used to devise an algorithm for Lyndon factorization that skip symbols.

  14. LF skip algorithm • Let us consider the alphabet {a,b,c,…}. Let us assume that the last character is not a. • Let wi start with aaad. We know that the prefix of wi+1 belongs to the set P = {aaaa,aaab,aaac,aaad}. • We search for occurrences of P with an algorithm (e.g. SBNDM) sublinear on average in order to skip characters. • aaadxxxxxxxxxxxaaac ---^---^--^^+++

  15. Run Length Encoding • Run-length encoding (RLE) is a very simple form of data compression in which runs of symbols are stored as a single data value. • Given string: aaaaaabbbccaaabbbccbbbbbaaa • RLE: a6b3c2a3b3c2b5a3

  16. Lyndon factorization of RLE string • The second variation is for strings compressed with run-length encoding. • Strings are stored in RLE for preferably.

  17. Lyndon factorization of RLE string • The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE. • Run of length t in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to t unit-length factors.

  18. Computing Lyndon factorization from RLE for T=aabaabaaac • For the sting T=aabaabaaac, parsed prefix P=T[1….i] of Lyndon word is equal to (uv)ku for strings u v and constant k. • RLE algorithm works in it is similar, except the runs are read instead of symbols.

  19. Computing Lyndon factorization from RLE for T=aabaabaaac • For i= 3, P = aab. The next run to be read is 'aa' and aabaa is still a prefix of a Lyndon word. The next iteration then starts with P = aabaa. • For i= 6, P = aabaab. The next run to be read is 'aaa' and aabaabaaa is not a prefix of a Lyndon word. • Next iteration starts on the suffix aaac of T with P = aaa.

  20. Complexity • Given a run-length encoded string R of length ρ, algorithm computes the Lyndon factorization of R in O(ρ) time. • It is preferable to Duval’s algorithm in the cases in which the strings are stored or maintained in run-length encoding.

  21. Experimental results • LF-skip algorithm and Duval’s algorithm with various texts. • LF-skip gave a significant speed-up over Duval’s algorithm. • Following table shows the speed-ups for random texts of 5 MB with various alphabets sizes.

  22. Speed-up of LF-skip

  23. Conclusion • Two variations of Duval’s algorithm for computing the Lyndon factorization of a string are presented. • The first algorithm is designed that skips a significant portion of the characters. • Experimental results show that the algorithm is considerably faster than Duval’s original algorithm. • The second algorithm is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O(ρ) time.

  24. THANK YOU

More Related