400 likes | 596 Views
Inferring Strings from Suffix Trees and Links on a Binary Alphabet. Tomohiro I , Shunsuke Inenaga , Hideo Bannai , Masayuki Takeda Kyushu University, Japan. Outline. Reverse Problems on String Data Structures Suffix Tree, Suffix Links Reverse Problem on Suffix Trees Efficient Solution
E N D
Inferring Strings from Suffix Trees and Linkson a Binary Alphabet Tomohiro I, ShunsukeInenaga,Hideo Bannai, Masayuki Takeda Kyushu University, Japan
Outline • Reverse Problems on String Data Structures • Suffix Tree, Suffix Links • Reverse Problem on Suffix Trees • Efficient Solution • Inferring a Labeling Function • Suffix Tour Graph • On a Binary Alphabet
Reverse Problems on String Data Structures • Direct Problem • Given a string, compute its data structure. • Reverse Problem • Given a data structure, compute its string. • Solving reverse problems could lead to deeper understanding of strings and data structures. border arrays, suffix arrays, DAWG, etc. string data structure string data structure Hot Topic
Reverse Problems on String Data Structures • KMP Failure Function • [Gawrychowski et al., 2010] • Runs • [Matsubara et al., 2010] • Palindromic Structure • [I et al., 2010] • Prefix Table • [Clement et al., 2009] • Cover Array • [Crochemore et al., 2010] • LPF Table • [He et al., 2011] • Border Array • [Franek et al., 2002] • [Duval et al., 2005] • Suffix Array • [Duval and Lefebvre, 2002] • [Bannai et al., 2003] • [Schürmann et al.,2005] • DAWG • [Bannai et al., 2003] • Parameterized Border Array • [I et al., 2009] • [I et al., 2010] We consider the reverse problem on suffix trees.
Suffix Tree, Suffix Links 12345678 ababaaa$ ababaaa$ababaaa$ ababaaa$ ababaaa$ ababaaa$ ababaaa$ ababaaa$ • The suffix tree of w is the compacted trie which represents the suffixes of w. • The suffix link of a node points to the node that represents the substring obtained by deleting the first character. $ a b Suffix link a 8 $ a b a 7 a b $ a a a a $ $ a 6 a b $ 5 4 a a a $ 2 a $ 3 Index of suffixes. 1
Direct Problem on Suffix Trees • Input : A stringw. • Output : The suffix tree of w. • It can be solved in lineartime [e.g. Ukkonen, 1995]. $ a b a 8 $ a b a 7 wababaaa$ a b $ a a a a $ $ a 6 a b $ 5 4 a a a $ 2 a $ 3 1
Reverse Problem on Suffix Trees • Input : An unlabeled ordered rooted treeT. • Output : A string which realizes T (if such exists). A string w is said to realizeTif the suffix tree of w is isomorphic to T.
Reverse Problem on Suffix Trees • Input : An unlabeled ordered rooted treeT and links f. • Output : A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of wand its suffix links are isomorphic to T and f. link function f
Reverse Problem on Suffix Trees • Input : An unlabeled ordered rooted treeT and links f. • Output : A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of wand its suffix links are isomorphic to T and f. $ a b link function f 12345678ababaaa$ 8 7 6 5 4 2 3 1
Reverse Problem on Suffix Trees • Input : An unlabeled ordered rooted treeT and links f for inner nodes. • Output : A string which realizes T and f (if such exists). A string w is said to realize (T, f ) if the suffix tree of wand its suffix links for inner nodes are isomorphic to T and f. link function f ababaaa$aaababa$aababaa$abaaaba$
How can we solve this problem? • Input : An unlabeled ordered rooted treeT and links f for inner nodes. • Output : A string which realizes T and f (if such exists).
How can we solve this problem? • Input : An unlabeled ordered rooted treeT and links f for inner nodes. • Output : A string which realizes T and f (if such exists). If we can infer a “correct” order of leaves, we can get a string. $ a b 12345678ababaaa$ 8 7 6 5 4 2 3 1
How can we solve this problem? • Input : An unlabeled ordered rooted treeT and links f for inner nodes. • Output : A string which realizes T and f (if such exists). If we can infer a “correct” order of leaves, we can get a string. $ a b 12345678ababaaa$ 8 7 A naïve solution of considering all permutations takes O(n!) time. 6 5 4 2 We need to take into account some “constraints” on leaves’ order,which are implicitly given by input (T, f ). We introduce suffix tour graphs to capture the constraints. 3 1
Outline • Reverse Problems on String Data Structures • Suffix Tree, Suffix Links • Reverse Problem on Suffix Trees • Efficient Solution • Inferring a Labeling Function • Suffix Tour Graph • On a Binary Alphabet
Notations • Input(T, f ) • V : the set of nodes of T • E : the set of edges of T • : the root node of T • Vin : the set of inner nodes of T • Vleaf : the set of leaf nodes of T v V, • V(v), Vin(v) and Vleaf(v) respectively represent the set of nodes, inner nodes and leaf nodes of the subtree rooted at v. • children(v) : the set of children of v. • chi(v) : the i-th child of v. • par(v) : the parent of v. ordered rooted tree T f : Vin{}Vin
Preconditions of an Input • The first child of the root node is a leaf.ch1() Vleaf • There exists a path of function f from any node v0 Vin{} to the root node . v0 Vin{}, v1, v2, …, vks.t. vk and vi f (vi1) for any 1 i k satisfied not satisfied
In what follows… Infer the first character of the string of each edge, namely, a labeling functiong : E ∪{$}. $ a b a $ a b a a b $ a a a a $ $ a a b $ a a a $ a $
Conditions for g to hold • The edge from to its first child is labeled with $.g((, ch1())) $. Infer the first character of the string of each edge, namely, a labeling functiong : E ∪{$}. $
Conditions for g to hold • The edge from to its first child is labeled with $.g((, ch1())) $. • The labels for the children are sorted in lexicographical order. v V, 1 i |children(v)|, g((v, chi(v))) g((v, chi1(v))). Infer the first character of the string of each edge, namely, a labeling functiong : E ∪{$}. $ v e a c d
Conditions for g to hold • Condition on links of parent-child nodes. v V{}, vp par(v), there exists u children( f (vp)) s.t. g((vp, v)) g(( f (vp), u)).In addition, if v Vinthenf (v) V(u). Infer the first character of the string of each edge, namely, a labeling functiong : E ∪{$}. f (vp) • c vp u c f (v) v
Labels for Inner Edges • By Condition 3, the labels for inner edges (edges from inner nodes to inner nodes)can be uniquely determined. • If the determined labels contradict Condition 2, the input turns out to be invalid. invalid $ a b $ a b a b a a
Lg andDg • When a labeling functiong holds Conditions 1~3, we define the following values for any nodev.Lg(v) |{uVleaf| f (par(u))par(v), g((par(u), u))g((par(v), v))}|Dg(v) = yV(v)Lg(y) 1 1 Lg(v) means # of leaves in the following situation. 0 1 0 par(v) 2 c par(u) 0 0 1 1 1 v c u 0 0
Lg andDg • When a labeling functiong holds Conditions 1~3, we define the following values for any nodev.Lg(v) |{uVleaf| f (par(u))par(v), g((par(u), u))g((par(v), v))}|Dg(v) = yV(v)Lg(y) Constraints in leaves’ order :The next leaf of u is inVleaf(v). Dg(v) leaves in Vleaf(v) have constraintson such u’s. 1 8 1 1 Lg(v) means # of leaves in the following situation. 0 4 1 1 0 2 par(v) 2 2 c par(u) 0 0 0 0 1 1 1 1 1 1 v c u 0 0 0 0
Conditions forg to hold • # of leaves of subtree rooted at v must be at least Dg(v).|Vleaf(v)| Dg(v) 0 1 8 1 1 Lg(v) means # of leaves in the following situation. 0 4 0 1 1 1 0 2 par(v) 2 2 0 0 0 c par(u) 0 0 0 0 1 1 1 1 1 1 v c 1 1 1 0 0 u 0 0 0 0 1 1
Suffix Tour Graph • When a labeling function g holds Conditions 1~4, we define the suffix tour graph STGg (VG, EG)w.r.t. g.VG VEG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))} ∪{(u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)} STGg $ a b 0 1 $ a b 0 0 0 a b $ a 1 1 1 0 0 a b 1 1
Lg(v) means # of leaves in the following situation. Suffix Tour Graph par(v) • When a labeling function g holds Conditions 1~4, we define the suffix tour graph STGg (VG, EG)w.r.t. g.VG VEG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))} ∪{(u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)} c par(u) v c u STGg $ a b 0 1 $ a b 0 0 0 a b $ a 1 1 1 0 0 a b 1 1
Lg(v) means # of leaves in the following situation. Suffix Tour Graph par(v) • When a labeling function g holds Conditions 1~4, we define the suffix tour graph STGg (VG, EG)w.r.t. g.VG VEG {(u, v) | uVleaf, f (par(u))par(v), g((par(u), u))g((par(v), v))} ∪{(u, v)k | (u, v)E, k |Vleaf(v)| Dg(v)} c par(u) v c u STGg $ a b 0 1 $ a b 0 0 0 a b $ a 1 1 1 0 0 a b STGg is an Eulerian graph.(possibly disjoint) 1 1
Necessary and Sufficient Conditionfor (T, f ) andg to be valid • When there exists such a cycle, a correct order of leaves that realizes (T, f) andg can be obtained by the order of visiting leaves on the cycle. STGg has an Eulerian cycle that contains and all leaves. STGg $ a b 0 8 1 $ a b 0 7 0 0 a b $ a 1 1 1 0 0 6 5 4 2 a b STGg is an Eulerian graph.(possibly disjoint) 1 1 3 1
Necessary and Sufficient Conditionfor (T, f ) andg to be valid • When there exists such a cycle, a correct order of leaves that realizes (T, f) andg can be obtained by the order of visiting leaves on the cycle. STGg has an Eulerian cycle that contains and all leaves. Example for an invalid labeling functiong. STGg $ a b 0 1 $ a b 1 0 0 a b a b 1 1 0 0 0 a b STGg is an Eulerian graph.(possibly disjoint) 1 1
Computing an Eulerian Cycle • An Eulerian cycle can be computed in linear time in the graph size. • We also showed that the size of STGg is linear in the input size. • Given g, we can check if g is valid or not by constructing STGg ⇒ computing an Eulerian cyclein linear time in the input size. What remains is to find a valid labeling function g.
On a Binary Alphabet • In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. STGg $ a b 0 1 $ a b 0 0 0 a b $ a 1 1 1 0 0 a b 1 1
On a Binary Alphabet • In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. STGg $ a b 0 1 $ a b 0 1 0 a b $ b 1 1 0 0 0 a b 1 1
On a Binary Alphabet • In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. STGg $ a b 0 1 $ a b 0 0 0 $ a a b 1 1 1 0 0 $ a 1 1
On a Binary Alphabet • In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. STGg $ a b 0 1 $ a b 0 1 0 $ b a b 1 1 0 0 0 $ b 1 1
On a Binary Alphabet • In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. STGg $ a b 0 1 $ a b 1 0 0 a b a b 1 1 0 0 0 a b 1 1
On a Binary Alphabet • In the case of binary alphabets, due to Conditions 1~4 it suffices to consider at most five labeling functions. STGg $ a b 0 1 $ a b 1 0 0 a b a b Theorem On a binary alphabet, the reverse problem on suffix trees can be solved in linear time. 1 1 0 0 0 a b 1 1
Summary • We introduced suffix tour graphs which lead to the efficient solution of the reverse problem on suffix trees.(Note that it can be applied to non-binary cases.) • On a binary alphabet, we showed that the problem can be solved in linear time in the input size. • What about non-binary cases?⇒It seems to be difficult ⇒since # of labeling functions gincrease combinatorially. • What about the problem in which suffix links are not given?⇒I do not have any idea. Open Problems
Exercise? • Compute a string which realizes this tree and links.
Hints • These labels are determined uniquely. $ a b $ a b $ a b a b a b $ a b a b $ a b
Exercise? • Compute a string which realizes this tree and links. babaabaaababaa$babaababaaabaa$babaaababaabaa$