Compressed Suffix Arrays based on Run-Length Encoding

Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWT RL FID

Abstract • We introduce a new full-text index that occupies O(Hk|T|) bits and supports counting queries in O(|P|) time.- optimal space / search time on constant alphabet- works on any alphabet size s, adding log s to the space/time bounds. Compressed suffix arrays based on run-length encoding

Introduction • We consider exact string matching on static text. • The task is to construct an index for the text such that the occurrences of a given pattern can be found efficiently. • Well known optimal solution exists: build a suffix tree over the text. Compressed suffix arrays based on run-length encoding

Introduction... • The suffix-tree-based solution takes O(|T| log |T|) bits of space. • Text itself can be represented in O(|T| log s)bits.- or even less space if text is compressible. • In many applications the space usage is the real bottleneck, not the search efficiency. Compressed suffix arrays based on run-length encoding

Introduction... • During the last 15 years, many practical / theoretical solutions with reduced space complexities have been proposed. • The work can roughly be divided into three categories:(1) Reducing constant factors(2) Concrete optimization(3) Abstract optimization Compressed suffix arrays based on run-length encoding

Reducing constant factors • Suffix arrays (Manber & Myers 1990) • Suffix cactuses (Kärkkäinen 1995) • Sparse suffix trees (Kärkkäinen & Ukkonen 1996) • Space-efficient suffix trees (Kurtz 1998) • Enhanced suffix arrays (Abouelhoda & Ohlebusch & Kurtz 2002) Compressed suffix arrays based on run-length encoding

Concrete optimization • “ Minimizing automata” • DAWGS (Blumer & Blumer & Haussler & McConnel & Ehrenfeucht 1983) • Compact DAWGS (Crochemore & Vérin 1997) • Compact suffix arrays (Mäkinen 2000) Compressed suffix arrays based on run-length encoding

Abstract optimization • Objective: Use as few space as possible to support the functionality of a given abstract definition of a data structure. • Space is measured in bits and usually given proportional to the entropy of the text. Compressed suffix arrays based on run-length encoding

Abstract optimization: Example • A full text index for a given text T supports the following operations:- Exists(P): is P a substring of T? - Count(P): how many times P occurs in T?- Report(P): list occurrences of P in T. Compressed suffix arrays based on run-length encoding

Abstract optimization... • Seminal work by Jacobson 1989: rank-select queries on bit-vectors. • Rank-select-type structures for suffix trees (Munro & Raman & Rao & Clark 1996-) • Lempel-Ziv index (Kärkkäinen & Ukkonen 1996) Compressed suffix arrays based on run-length encoding

Abstract optimization... • Compressed suffix arrays (Grossi & Vitter 2000, Sadakane 2000, 2002) • FM-index (Ferragina & Manzini 2000) • LZ-self-index (Navarro 2002) • Space-optimal full-text indexes (Grossi & Gupta & Vitter 2003, 2004) • Alphabet friendly FM-index (Ferragina & Manzini & Mäkinen & Navarro) • See also ISAAC'04, SODA'05,... Compressed suffix arrays based on run-length encoding

This talk • We show that combining FM-index with compact suffix array gives a practical full-text index with good space / search time tradeoff. • Our structure, Run-Length FM-index, usesO(min(|T|(Hk log s +1),|T|log s) bits and supports Count(P) in O(|P|log s) time. Compressed suffix arrays based on run-length encoding

This talk... • Hk=Hk(T) is the order-k empirical entropy of T, i.e., “the average number of bits needed to encode a symbol using a fixed codebook for each possible combination of k previous symbols”. • There holds 0  Hk Hk-1 ...  H0 log s. Compressed suffix arrays based on run-length encoding

FM-index • Let us first describe a simple variant of the FM-index that:- occupies O(|T| log s)bits, and- supports counting queries in O(|P| log s) time. Compressed suffix arrays based on run-length encoding

Simple FM-index • Construct the Burrows-Wheeler-transformedtext bwt(T) [BW94]. • From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time. • Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T). Compressed suffix arrays based on run-length encoding

Burrows-Wheeler transformation • Construct a matrix M that contains as rows all rotations of T. • Sort the rows in the lexicographic order. • LetL be the last column and F be the first column. • bwt(T)=L associated with the row number of T in the sorted M. Compressed suffix arrays based on run-length encoding

Example pos 123456789 T = kalevala# F L sa M 1:9 #kalevala 2:8 a#kaleval 3:6 ala#kalev 4:2 alevala#k 5:4 evala#kal 6:1 kalevala# 7:7 la#kaleva 8:3 levala#ka 9:5 vala#kale L = alvkl#aae, row 6 ==> Exercise: Given L and the row number, how to compute Tand sa(T)? Compressed suffix arrays based on run-length encoding

sort F L M a l v k l # a a e … i 1 2 3 4 5 6 7 8 9 LF[i] 2 7 9 6 8 1 3 4 5 T-1= # a l a v e l a k L sa(T) 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e k a l e v a l # a a a e k l l v 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 8 6 2 4 a l e v a l a 1 7 3 5

Implicit LF[i] • Ferragina and Manzini (2000) noticed the following connection: • LF[i]=CT[L[i]]+rankL[i](L,i) • HereCT[c] : amount of letters 0,1,...,c-1 in L=bwt(T)rankc(L,i) : amount of letters c in the prefix L[1,i] Compressed suffix arrays based on run-length encoding

Rank/Select select1(L,j) 3 6 9 10 12 L 001001001101 rank1(L,i) 001112223445 Compressed suffix arrays based on run-length encoding

sort F L M a l v k l # a a e … i 1 2 3 4 5 6 7 8 9 LF[i] 2 7 9 6 8 1 3 4 5 T-1= # a l a v e l a k L sa(T) 1 a 2 l 3 v 4 k 5 l 6 # 7 a 8 a 9 e # a a a e k l l v 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 8 6 2 4 1 7 3 5 LF[7]=CT[a]+ranka(L,7) =1+2=3

Backward search on bwt(T) • Observation: If [i,j] is the range of rows of M that start with string X, then the range [i’,j’] containing cX can be computed asi’ := CT[c]+rankc(L,i-1)+1,j’ := CT[c]+rankc(L,j). Compressed suffix arrays based on run-length encoding

rankv(L,i-1)=0 rankv(L,j)=1 Backward search on bwt(T)… L M vX=va? #k a# al al ev ka la le va a l v k l # a a e i X=a j i’ := 8 + 0 + 1 … C[’v’]=8 j’ := 8 + 1 i’, j’ Compressed suffix arrays based on run-length encoding

Backward search on bwt(T) … AlgorithmCount(P[1,m], L[1,n],CT[1,s]) • c = P[m]; k = m; • i = CT[c]+1; j = CT[c+1]; • while (i ≤ j and k>1) do begin • c = P[k-1]; k = k-1; • i = CT[c]+rankc(L,i-1)+1; • j = CT[c]+rankc(L,j); end; • if (j<i) then return0else return (j-i+1); Compressed suffix arrays based on run-length encoding

Backward search on bwt(T)... • Array CT[1,] takes O( log |T|) bits. • L=Bwt(T) takes O(|T| log ) bits. • Assuming rankc(L,i) can be computed in constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T. Compressed suffix arrays based on run-length encoding

Answering rankc(L,i) • Wavelet tree (GGV 2003) is a data structure replacingL=bwt(T):- supports rankc(L,i) in O(log ) time, and- occupies |T|H0(T) +o(|T|) bits. • Generalized wavelet tree (FMMN 2004) improves query time to constant when =O(polylog(|T|)). Compressed suffix arrays based on run-length encoding

Simple FM-index... • We obtained a structure that- occupies O(|T|H0(T))bits, supports counting queries in O(|P|log ) time. • Original FM-index takes O(Hk|T|) bits, but only on constant alphabet. • Compression boosting can be applied to improve simple FM-index to take only O(|T|Hk(T)) bits (FMMN 2004). Compressed suffix arrays based on run-length encoding

To partition or not... • All alphabet-friendly solutions obtaining O(|T|Hk(T)) space for compressed suffix arrays use optimal partitioning of BWT text, and store explicitly the distribution for each piece.- always (k+1) overhead. • MTF+zeroth order coding take O(|T|Hk(T))+O(k), but supporting queries on larger alphabets is non-trivial. Compressed suffix arrays based on run-length encoding

Run-Length FM-index • We make the following changes to the previous FM-index variant:- L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|],- Cumulative array CT[1,c] is replaced by CS[1,c],- wavelet tree is build on S, and- some formulas are changed. Compressed suffix arrays based on run-length encoding

L B S L F B’ c c c a a g g a t t 1 0 0 1 0 1 0 1 1 0 c a g a t c c c a a g g a t t a a a c c c g g t t 1 0 1 1 0 0 1 0 1 0 Run-Length FM-index... Compressed suffix arrays based on run-length encoding

Changes to formulas • Recall that we need to compute CT[c]+rankc(L,i) in the backward search. • Theorem:C[c]+rankc(L,i) is equivalent to select1(B’,CS[c]+1+rankc(S,rank1(B,i)))-1,when L[i]¹ c, and otherwise to select1(B’,CS[c]+rankc(S,rank1(B,i)))+i-select1(B,rank1(B,i)). Compressed suffix arrays based on run-length encoding

L F B S B’ c c c a a g g a t t a a a c c c g g t t 1 0 0 1 0 1 0 1 1 0 c a g a t 1 0 1 1 0 0 1 0 1 0 Example, L[i]=c LF[8]= select1(B’,CS[a]+ranka(S,rank1(B,8)))+ 8-select1(B,rank1(B,8)) = select1(B’,0+ranka(S,4))+8-select1(B,4) = select1(B’,0+2)+8-8 = 3 Compressed suffix arrays based on run-length encoding

Space requirement • CS[1,s] takes O(s log |T|) bits. • B and B’ with rank/select dictionaries take 2|T|+o(|T|) bits. • S represented using wavelet tree occupies |S|H0(S)+o(|S|) bits. • In CPM 2004, we have shown that |S|  Hk|T| +sk. Compressed suffix arrays based on run-length encoding

Comparison 5 60

Compressed Suffix Arrays based on Run-Length Encoding

Compressed Suffix Arrays based on Run-Length Encoding

Presentation Transcript

Suffix trees and suffix arrays

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Approximate String Matching using Compressed Suffix Arrays

Fluorescence based self-encoding micro-bead sensor arrays

Compressed Compact Suffix Arrays

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Optimizing multi-pattern searches for compressed suffix arrays

Suffix Trees and Suffix Arrays

Run-Length Encoding for Texture Classification

Counting Suffix Arrays and Strings

Suffix arrays

Compressed Suffix Arrays and Suffix Trees

Genomic Repeat Visualisation Using Suffix Arrays

Suffix Trees and Suffix Arrays

Approximate Matching of Run-Length Compressed Strings

Linear-Time Search in Suffix Arrays

Suffix Arrays

More Arrays Length, constants, and arrays of arrays

Compressed Suffix Arrays

Modeling Delta Encoding of Compressed Files

The SBC-Tree: An Index for Run-Length Compressed Sequences