350 likes | 535 Views
Compressed Suffix Arrays based on Run-Length Encoding. Veli Mäkinen Bielefeld University. Gonzalo Navarro University of Chile. BWT. RL. FID. Abstract.
E N D
Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWT RL FID
Abstract • We introduce a new full-text index that occupies O(Hk|T|) bits and supports counting queries in O(|P|) time.- optimal space / search time on constant alphabet- works on any alphabet size s, adding log s to the space/time bounds. Compressed suffix arrays based on run-length encoding
Introduction • We consider exact string matching on static text. • The task is to construct an index for the text such that the occurrences of a given pattern can be found efficiently. • Well known optimal solution exists: build a suffix tree over the text. Compressed suffix arrays based on run-length encoding
Introduction... • The suffix-tree-based solution takes O(|T| log |T|) bits of space. • Text itself can be represented in O(|T| log s)bits.- or even less space if text is compressible. • In many applications the space usage is the real bottleneck, not the search efficiency. Compressed suffix arrays based on run-length encoding
Introduction... • During the last 15 years, many practical / theoretical solutions with reduced space complexities have been proposed. • The work can roughly be divided into three categories:(1) Reducing constant factors(2) Concrete optimization(3) Abstract optimization Compressed suffix arrays based on run-length encoding
Reducing constant factors • Suffix arrays (Manber & Myers 1990) • Suffix cactuses (Kärkkäinen 1995) • Sparse suffix trees (Kärkkäinen & Ukkonen 1996) • Space-efficient suffix trees (Kurtz 1998) • Enhanced suffix arrays (Abouelhoda & Ohlebusch & Kurtz 2002) Compressed suffix arrays based on run-length encoding
Concrete optimization • “ Minimizing automata” • DAWGS (Blumer & Blumer & Haussler & McConnel & Ehrenfeucht 1983) • Compact DAWGS (Crochemore & Vérin 1997) • Compact suffix arrays (Mäkinen 2000) Compressed suffix arrays based on run-length encoding
Abstract optimization • Objective: Use as few space as possible to support the functionality of a given abstract definition of a data structure. • Space is measured in bits and usually given proportional to the entropy of the text. Compressed suffix arrays based on run-length encoding
Abstract optimization: Example • A full text index for a given text T supports the following operations:- Exists(P): is P a substring of T? - Count(P): how many times P occurs in T?- Report(P): list occurrences of P in T. Compressed suffix arrays based on run-length encoding
Abstract optimization... • Seminal work by Jacobson 1989: rank-select queries on bit-vectors. • Rank-select-type structures for suffix trees (Munro & Raman & Rao & Clark 1996-) • Lempel-Ziv index (Kärkkäinen & Ukkonen 1996) Compressed suffix arrays based on run-length encoding
Abstract optimization... • Compressed suffix arrays (Grossi & Vitter 2000, Sadakane 2000, 2002) • FM-index (Ferragina & Manzini 2000) • LZ-self-index (Navarro 2002) • Space-optimal full-text indexes (Grossi & Gupta & Vitter 2003, 2004) • Alphabet friendly FM-index (Ferragina & Manzini & Mäkinen & Navarro) • See also ISAAC'04, SODA'05,... Compressed suffix arrays based on run-length encoding
This talk • We show that combining FM-index with compact suffix array gives a practical full-text index with good space / search time tradeoff. • Our structure, Run-Length FM-index, usesO(min(|T|(Hk log s +1),|T|log s) bits and supports Count(P) in O(|P|log s) time. Compressed suffix arrays based on run-length encoding
This talk... • Hk=Hk(T) is the order-k empirical entropy of T, i.e., “the average number of bits needed to encode a symbol using a fixed codebook for each possible combination of k previous symbols”. • There holds 0 Hk Hk-1 ... H0 log s. Compressed suffix arrays based on run-length encoding
FM-index • Let us first describe a simple variant of the FM-index that:- occupies O(|T| log s)bits, and- supports counting queries in O(|P| log s) time. Compressed suffix arrays based on run-length encoding
Simple FM-index • Construct the Burrows-Wheeler-transformedtext bwt(T) [BW94]. • From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time. • Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T). Compressed suffix arrays based on run-length encoding
Burrows-Wheeler transformation • Construct a matrix M that contains as rows all rotations of T. • Sort the rows in the lexicographic order. • LetL be the last column and F be the first column. • bwt(T)=L associated with the row number of T in the sorted M. Compressed suffix arrays based on run-length encoding
Example pos 123456789 T = kalevala# F L sa M 1:9 #kalevala 2:8 a#kaleval 3:6 ala#kalev 4:2 alevala#k 5:4 evala#kal 6:1 kalevala# 7:7 la#kaleva 8:3 levala#ka 9:5 vala#kale L = alvkl#aae, row 6 ==> Exercise: Given L and the row number, how to compute Tand sa(T)? Compressed suffix arrays based on run-length encoding
sort F L M a l v k l # a a e … i 1 2 3 4 5 6 7 8 9 LF[i] 2 7 9 6 8 1 3 4 5 T-1= # a l a v e l a k L sa(T) 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e k a l e v a l # a a a e k l l v 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 8 6 2 4 a l e v a l a 1 7 3 5
Implicit LF[i] • Ferragina and Manzini (2000) noticed the following connection: • LF[i]=CT[L[i]]+rankL[i](L,i) • HereCT[c] : amount of letters 0,1,...,c-1 in L=bwt(T)rankc(L,i) : amount of letters c in the prefix L[1,i] Compressed suffix arrays based on run-length encoding
Rank/Select select1(L,j) 3 6 9 10 12 L 001001001101 rank1(L,i) 001112223445 Compressed suffix arrays based on run-length encoding
sort F L M a l v k l # a a e … i 1 2 3 4 5 6 7 8 9 LF[i] 2 7 9 6 8 1 3 4 5 T-1= # a l a v e l a k L sa(T) 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e # a a a e k l l v 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 8 6 2 4 1 7 3 5 LF[7]=CT[a]+ranka(L,7) =1+2=3
Backward search on bwt(T) • Observation: If [i,j] is the range of rows of M that start with string X, then the range [i’,j’] containing cX can be computed asi’ := CT[c]+rankc(L,i-1)+1,j’ := CT[c]+rankc(L,j). Compressed suffix arrays based on run-length encoding
rankv(L,i-1)=0 rankv(L,j)=1 Backward search on bwt(T)… L M vX=va? #k a# al al ev ka la le va a l v k l # a a e i X=a j i’ := 8 + 0 + 1 … C[’v’]=8 j’ := 8 + 1 i’, j’ Compressed suffix arrays based on run-length encoding
Backward search on bwt(T) … AlgorithmCount(P[1,m], L[1,n],CT[1,s]) • c = P[m]; k = m; • i = CT[c]+1; j = CT[c+1]; • while (i ≤ j and k>1) do begin • c = P[k-1]; k = k-1; • i = CT[c]+rankc(L,i-1)+1; • j = CT[c]+rankc(L,j); end; • if (j<i) then return0else return (j-i+1); Compressed suffix arrays based on run-length encoding
Backward search on bwt(T)... • Array CT[1,] takes O( log |T|) bits. • L=Bwt(T) takes O(|T| log ) bits. • Assuming rankc(L,i) can be computed in constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T. Compressed suffix arrays based on run-length encoding
Answering rankc(L,i) • Wavelet tree (GGV 2003) is a data structure replacingL=bwt(T):- supports rankc(L,i) in O(log ) time, and- occupies |T|H0(T) +o(|T|) bits. • Generalized wavelet tree (FMMN 2004) improves query time to constant when =O(polylog(|T|)). Compressed suffix arrays based on run-length encoding
Simple FM-index... • We obtained a structure that- occupies O(|T|H0(T))bits, supports counting queries in O(|P|log ) time. • Original FM-index takes O(Hk|T|) bits, but only on constant alphabet. • Compression boosting can be applied to improve simple FM-index to take only O(|T|Hk(T)) bits (FMMN 2004). Compressed suffix arrays based on run-length encoding
To partition or not... • All alphabet-friendly solutions obtaining O(|T|Hk(T)) space for compressed suffix arrays use optimal partitioning of BWT text, and store explicitly the distribution for each piece.- always (k+1) overhead. • MTF+zeroth order coding take O(|T|Hk(T))+O(k), but supporting queries on larger alphabets is non-trivial. Compressed suffix arrays based on run-length encoding
Run-Length FM-index • We make the following changes to the previous FM-index variant:- L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|],- Cumulative array CT[1,c] is replaced by CS[1,c],- wavelet tree is build on S, and- some formulas are changed. Compressed suffix arrays based on run-length encoding
L B S L F B’ c c c a a g g a t t 1 0 0 1 0 1 0 1 1 0 c a g a t c c c a a g g a t t a a a c c c g g t t 1 0 1 1 0 0 1 0 1 0 Run-Length FM-index... Compressed suffix arrays based on run-length encoding
Changes to formulas • Recall that we need to compute CT[c]+rankc(L,i) in the backward search. • Theorem:C[c]+rankc(L,i) is equivalent to select1(B’,CS[c]+1+rankc(S,rank1(B,i)))-1,when L[i]¹ c, and otherwise to select1(B’,CS[c]+rankc(S,rank1(B,i)))+i-select1(B,rank1(B,i)). Compressed suffix arrays based on run-length encoding
L F B S B’ c c c a a g g a t t a a a c c c g g t t 1 0 0 1 0 1 0 1 1 0 c a g a t 1 0 1 1 0 0 1 0 1 0 Example, L[i]=c LF[8]= select1(B’,CS[a]+ranka(S,rank1(B,8)))+ 8-select1(B,rank1(B,8)) = select1(B’,0+ranka(S,4))+8-select1(B,4) = select1(B’,0+2)+8-8 = 3 Compressed suffix arrays based on run-length encoding
Space requirement • CS[1,s] takes O(s log |T|) bits. • B and B’ with rank/select dictionaries take 2|T|+o(|T|) bits. • S represented using wavelet tree occupies |S|H0(S)+o(|S|) bits. • In CPM 2004, we have shown that |S| Hk|T| +sk. Compressed suffix arrays based on run-length encoding
Comparison 5 60