360 likes | 551 Views
Computing longest common substring and all palindromes from compressed strings. Wataru Matsubara 1 , Shunsuke Inenaga 2 , Akira Ishino 1 , Ayumi Shinohara 1 , Tomoyuki Nakamura 1 , Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan
E N D
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara1, Shunsuke Inenaga2, Akira Ishino1, Ayumi Shinohara1, Tomoyuki Nakamura1, Kazuo Hashimoto1 1Graduate School of Information SciencesTohoku University, Japan 2Department of Computer Science and Communication Engineering, Kyushu University, Japan
What is compressed string algorithm? input text A palindrome is a symmetric string. It is interesting on their own as word puzzles. For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
What is compressed string algorithm? input text output A palindrome is a symmetric string. It is interesting on their own as word puzzles. For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. : find palindromes mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :
What is compressed string algorithm? compressed text One solution would be to decompress the compressed text. e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO… The decompressed size can be exponentially large with respect to the compressed size. decompress decompressed text output A palindrome is a symmetric string. It is interesting on their own as word puzzles. For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. : find palindromes mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :
Goal of algorithms for Compressed strings • Process the compressed text without decompression. • Processing time should be polynomial in n. • Decompressed size can be exponentially large with respect to n. n : the size of compressed text
Compressed schemes • run-length encoding • Lempel-Ziv • grammar based compression : [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text. Straight Line Program
T: sequence of assignments X1 = expr1; X2 = expr2;… ; Xn= exprn; Xk: variable, a(a ) Xi Xj(i, j < k ). exprk : Definition of Straight Line Program (SLP) SLPT SLP T for string w is a CFG in Chomsky normal form s.t. L(T) = {w}.
Straight Line Program(SLP)Example SLP X1 = a X2 = b X3 = X1X2 X4 = X3X1 X5 = X3X4 X6 = X5X5 X7 = X4X6 X8 = X7X5 n T = N N = O(2n)
Straight Line Program(SLP)Example SLP X1 = a X2 = b X3 = X1X2 X4 = X3X1 X5 = X3X4 X6 = X5X5 X7 = X4X6 X8 = X7X5 X8 n X7 X5 T = N N = O(2n)
Efficient algorithms for compressed strings • substring matching • Karpinski et al (1996) O(n4logn) time • Miyazaki et al (1997) O(n4) time • Lifshits (2006) O(n3) time • minimum period • Karpinski et al (1996) O(n4logn) time • Lifshits (2006) O(n3logN) time • all squares • Gasieniec et al (1994) O(n6log5N) time
Hardness results • Subsequence pattern matching • Lifshits and Lohrey (2006) NP-hard • Longest common subsequence • Lifshits and Lohrey (2006) NP-hard • Hamming distance • Lifshits (2007) #P-complete Is there any reasonable comparison measurement for compressed strings?
String comparison measures Hamming distance Longest common subsequence Longest common substring a ab b a a a ba a b a uncompressed text O(N) O(N2 / logN) O(N) a ab b a a a ab b a a #P-comprete[Lifshits 07] NP-hard [Lifshits and Lohrey06] ?? compressed text a ba a b a a ba a b a we solve this problem
Our Result1: Longest Common Substring Problem Given two SLP TandS that are descriptions of text Tand Srespectively, compute LCStr(T, S). Theorem LCStr(T, S)can be computed inO(n4logn)time using O(n3)space. LCStr(T,S): the length of longest common substring ofTandS n : the total size of the input SLP
Previous best result: O(n5log4N) time Our Result2: palindromes Problem Given SLP T, compute(compressed representations)the set of all palindromes ofT. Theorem The problem can be solved inO(n4)time using O(n2)space. n : the size of SLPT N :the length of original text T (note that N = O(2n) [Gasienec et al 1996]
Details of our algorithm Computing longest common substring Computing palindromes (omitted in this talk)
Property of common substrings (1/3) • For each common substring Z of string Sand T,there always exists a variable Xi = XlXr and Yj = YLYRsuch that: • Z is a common substring of Xi and Yj • Z contains an overlap between Xl and YR Overlap Xi Xl Xr Z common substring w Z YL YR Yj
Property of common substrings (2/3) • For each common substring Z of string Sand T,there always exists a string wsuch that: • w is a substring of Z • w is an overlap of variables of Sand T Overlap Xi Xl Xr w YR YL Yj
Property of common substrings (3/3) • For each common substring Z of string Sand T,there always exists a string wsuch that: • Z can be calculate by expanding w Overlap Xi Xl Xr Z common substring w Z YL YR Extend Process Yj
Overlaps (OL) For any strings X,Y, the set of the lengths of overlaps of X and Y. X Y
OverlapsExample OL(“aabaaba”, “abaababb”) = {1, 3, 6} Xl a a b a a b a YR YR YR a b a a b a a b a b a b a a b a a b a b a b a a b a a b a b
Computing Overlaps[Karpinski et al 1996] Lemma For any variables Xi and Xj of SLP T, OL(Xi, Xj) can be represented by O(n) arithmetic progressions. Xi Yj Theorem For any SLP T, OL(Xi, Xj) can be computed in total of O(n4logn) time and O(n3) space.
How to extend overlaps Xi Xl Xr a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a YL YR Yj a b a ∈ OL(Xl, YR)
How to extend overlaps Xi Xl Xr match a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a YL YR Yj a b a ∈ OL(Xl, YR)
How to extend overlaps Xi Xl Xr match a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a YL YR Yj a b a ∈ OL(Xl, YR)
How to extend overlaps Xi Xl Xr match a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a YL YR Yj a b a ∈ OL(Xl, YR)
How to extend overlaps Xi Xl Xr mismatch a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a YL YR Yj a b a ∈ OL(Xl, YR)
How to extend overlaps Xi Xl Xr a a a b a b a b a a b a b a b b mismatch a a b a a b a a b a b a b a a b a Yl Yr Yj a b a ∈ OL(Xl, YR)
How to extend overlaps We are not allowed to process character by character. Xi Xl Xr a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a Yl Yr Yj a b a ∈ OL(Xl, YR)
First-mismatch function[Karpinski et al 1996] input :SLP variables Xi and Yj, integer k output :position of first mismatch p p [p]} -1 Xi Mismatch a b a b a a b a b a a b a b a b a b a a b a k Yj
First-mismatch function[Karpinski et al 1996] Lemma Provided that the sets of overlaps are already computed, FM(Xi, Yj, k)can be computed in O(nlogn)time.
Extending overlaps using FM function Lemma Extending overlaps can be done by O(n) calls of FM function.
pseudo-code Computing longest common substring O(n2) items O(n) calls of FM function. O(nlogn) times Totally, LCStr (S, T) can be computed inO(n2×n×nlogn)= O(n4logn)time.
Conclusions • Computing longest common substring from compressed string • O(n4logn) time and O(n3) space • Computing all palindromes from compressed string • O(n4)time and O(n2) space