240 likes | 382 Views
Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees. Jérémy Barbay, Meng He , J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT University of Copenhagen. Background: Succinct Data Structures. What are succinct data structures Jacobson 1989
E N D
Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees Jérémy Barbay, Meng He, J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT University of Copenhagen
Background: Succinct Data Structures • What are succinct data structures • Jacobson 1989 • Why succinct data structures • Large data sets in modern applications: textual, genomic, spatial or geometric • An implementation: Delpratt et al. 2006 • Succinct integrated encodings • Main data and auxiliary data structures
Our Problem: Succinct Indexes • Use of the concept in previous work • Compact PAT trees: Clark & Munro 1996 • Lower bounds: Demaine & López-Ortiz 2001; Miltersen 2005 • Upper bounds: Sadakane & Grossi 2006 • Definition of succinct indexes in data structure design • ADT: primitive access operators • Succinct index: more powerful operators
Succinct Integrated Encodings X + Auxiliary Data Structures Navigational Operations Main Data
Succinct Indexes + Navigational Operations Main Data Succinct Index
Succinct Indexes vs. Integrated Encodings • Maximizing the freedom of the encoding of the main data • Allowing incremental design • Supporting implicit data
Strings: Definitions • Notation • Alphabet: [σ]={1, 2, …, σ} • String: S[1..n] • Operations: • string_access(x): S[x] • string_rank(α, x): number of occurrences of α in S[1..x] • string_select(α, r): position of the rth occurrence of α in S
Strings: An Example S = a a b a c c c d a d d a b b b c string_access(8) = d string_rank(a, 8) = 3 string_select(b, 3) = 14
Strings: Previous Results • Succinct Integrated Encodings • Wavelet trees: Grossi et al. 2003 • Space: nH0 + o(n)∙lgσ bits • Time: O(lgσ) time for all three operations • Golynski et al. 2006 • Space: n (lgσ + o(lgσ)) bits • Time: O(lglgσ) time for string_access and string_rank, O(1) time for string_select
Strings: Our Results • Succinct Indexes • ADT • string_access: f(n, σ) time • Space: n∙o(lgσ) bits • Operations • string_rank: O(lglgσ lglglgσ (f(n, σ)+lglgσ)) • string_select: O(lglglgσ (f(n, σ)+lglgσ)) • Other operations: negations
Binary Relations: Definitions • Notation • Binary relation: R ⊆ [n] x [σ] • Number of objects: n; number of labels: σ • Number of object-label pairs: t • Operations • object_access(x, r): rth label associated with x • label_access(x, α): whether x is associated with α • label_rank(α, x): number of objects labeled α up to object x • label_select(α, r): rth object labeled α
Binary Relations: An Example n object_access(1, 2) = 4 0 1 0 1 00 0 0 1 01 0 1 1 01 1 0 0 1 label_access(2, 3) = false σ label_rank(3, 4) = 3 label_select(4, 3) = 5
Binary Relations: Previous Results • Succinct Integrated Encodings • Barbay et al., 2006 • Space: t (lgσ + o(lgσ)) bits • Time: O(lglgσ) time for object_access, label_rank and label_access, O(1) time for label_select
Binary Relations: Our Results • Succinct Indexes • ADT: • object_access: f(n,σ,t) • Space: • t∙o(lgσ) bits • Time: • label_rank and label_access: O(lglgσ lglglgσ (f(n,σ,t) + lglgσ)) • label_select: O(lglglgσ (f(n,σ,t) + lglgσ))
Multi-labeled Trees: Definitions • Notation • Number of nodes: n • Number of labels: σ • Number of node-label pairs: t • Operations • α-descendant • α-child • α-ancestor
Multi-labeled Trees: An Example Node 2 is a c-ancestorof node 6 {a, c, d} 1 {c, d} {a, c} 2 8 Node 6 is a b-descendantof node 2 {a} {a, b} {b,d} {c} {c,d} {b,c,d} 3 4 7 9 10 11 Node 10 is a d-childof node 8 5 6 {b} {a, b}
Multi-labeled Trees: Previous Results • Labeled trees • Geary et al. 2004 • Ferragina et al. 2005 • Barbay et al. 2006 • Multi-labeled trees • Barbay et al. 2006
Multi-labeled Trees: Our Approach • Traversal Orders • Preorder • DFUDS order • Ordinal Trees: DFUDS • Benoit et al. 1999 & 2005 • Jansson et al. 2007 • 2 Binary Relations • Nodes in preorder & labels • Nodes in DFUDS order & labels 1 3 2 8 4 5 6 3 4 7 9 10 11 7 8 5 6
Multi-labeled Trees: Our Results • Succinct Indexes • ADT: node_label(x, r) • Supporting α-child/descendant queries: t∙o(lgσ) bits • Supporting α-child/descendant/ancestor queries: t∙(lgρ + o(lgρ) + o(lgσ))bits (ρ: recursivity) • Supporting α-child/descendant/ancestor queries of node x after another node y
Applications • Compressed Succinct Encodings • Strings • Space: nHk + o(nlgσ) bits • Operations: • string_access: O(1) • String_rank: O((lglgσ)2lglglgσ) • string_select: O(lglgσ lglglgσ) • First high-order entropy-compressed encoding supporting rank/select efficiently • Other Data Structures
Applications (Continued) • High-order entropy-compressed text indexes for large alphabets • Notations: n-text size, σ-alphabet size, m-pattern length, occ-number of occurrences • Our results • Space: nHk+o(nlgσ) bits • Pattern searching: O(mlglgσ+occ lg1+εnlglgσ) • Previous results: a lgσ factor instead of lglgσ or incompressible
Conclusions • We showed the importance of succinct indexes in the design of succinct data structures by designing: • Succinct representation of multi-labeled trees that supports efficient retrieval of ancestors / children / descendants by label • First high-order entropy compressed representation of strings supporting rank/select • High-order entropy compressed text indexes for large alphabets
Conclusions (Continued) The concept of succinct indexes is useful in designing succinct data structures … it maximizes the freedom of the encoding of the main data and leads to a rich choice of design tradeoffs.