170 likes | 317 Views
Robert Krauthgamer (Weizmann Institute and IBM Almaden) Based on joint work with Moses Charikar, with Yuval Rabani, with Parikshit Gopalan and T.S. Jayram. with Alex Andoni. On Embedding Edit Distance into L 1. Edit Distance. x 2 n , y 2 m.
E N D
Robert Krauthgamer (Weizmann Institute and IBM Almaden) Based on joint work with Moses Charikar, with Yuval Rabani, with Parikshit Gopalan and T.S. Jayram. with Alex Andoni On Embedding Edit Distance into L1 On Embedding Edit Distance into L_1
On Embedding Edit Distance into L_1 Edit Distance x 2n, y 2m ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance] Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications: • Genomics • Text processing • Web search For simplicity: m = n. X
On Embedding Edit Distance into L_1 Embedding into L1 An embedding of (X,d) into l1is a map f : X!l1. It has distortionK¸1 if d(x,y) ≤ kf(x)-f(y)k1 ≤ K d(x,y)8x,y2X Very powerful concept (when distortion is small) Goal: Embed edit distance into l1 with small distortion Motivation: Reduce algorithmic problems to l1 E.g. Nearest-Neighbor Search Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.
Known Results for Edit Distance Embed ({0,1}n, ED) into L1 Previous bounds Upper bound: 2O(√log n) [Ostrovsky-Rabani’05] O(n2/3)[Bar Yossef-Jayram-K.-Kumar’04] Lower bound: (log n) [K.-Rabani’06] (log n)1/2-o(1)[Khot-Naor’05] and 3/2 [Andoni-Deza-Gupta-Indyk-Raskhodnikova’03] Large Gap … Despite signficant effort!!! On Embedding Edit Distance into L_1
On Embedding Edit Distance into L_1 Submetrics (Restricted Strings) • Why focus on submetrics of edit distance? • May admit smaller distortion • Partial progress towards general case • A framework to analyzing non worst-case instances • Example (a la computational biology): Handle only “typical” strings • Class 1: • A string is k-non-repetitive if all its k-substrings are distinct • A random 0-1 string is WHP (2log n)-non-repetitive • Yields a submetric containing 1-o(1) fraction of the strings • Class 2: • Ulam metric = edit distance on all permutations (here ={1,…,n}) • Every permutation is 1-non-repetitive • Note: k-non-repetitive strings embed into Ulam with distortion k. k=7 Theory of Computation Seminar, Computer Science Department
Known Results for Ulam Metric Embed ({0,1}n, ED) into L1 Embed Ulam metric into L1 Upper bound: 2O(√log n) [Ostrovsky-Rabani’05] O(log n)[Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) Lower bound: (log n) [K.-Rabani’06] log n/loglog n)[Andoni-K.’07] (Actually qualitatively stronger) Large Gap … Near-tight! On Embedding Edit Distance into L_1
On Embedding Edit Distance into L_1 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n). Proof. Define where Intuition: • sign(fa,b(P)) is indicator for “a appears before b” in P • Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q) • Suppose Q is obtained from P by moving one symbol, say ‘s’ • General case then follows by applying triangle inequality on P,P’,P’’,…,Q • Total contribution of • coordinates s2{a,b} is 2k (1/k) ≤ O(log n) • other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)
On Embedding Edit Distance into L_1 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n). Proof. Define where Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q) Claim 2: ||f(P)-f(Q)||1¸ ½ ED(P,Q) • Assume wlog that P=identity • Edit Q into an increasing sequence (thus into P) using quicksort: • Choose a random pivot, • Delete all characters inverted wrt to pivot • Repeat recursively on left and right portions • Now argue ||f(P)-f(Q)||1¸E[ #quicksort deletions ] ¸ ½ ED(P,Q) Surviving subsequence is increasing ED(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|
On Embedding Edit Distance into L_1 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n) Proof sketch: • Suppose embeds with distortion D¸1, and let V={0,1}n. • By the cut-cone characterization of L1: • For every symmetric probability distributions and over V£V, The embedding f into L1 can be written as Hence,
On Embedding Edit Distance into L_1 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n) Proof sketch: • Suppose embeds with distortion D¸1, and let V={0,1}n. • By the cut-cone characterization of L1: • For every symmetric probability distributions and over V£V, • We choose: • =uniform over V£V • =½(H+S) where • H=random point+random bit flip (uniform over EH={(x,y): ||x-y||1=1}) • S=random point+a cyclic shift (uniform over ES={(x,S(x)} ) • The RHS of (*) evaluates to O(D/n) by a counting argument. • Main Lemma: For all AµV, the LHS of (*) is (log n) / n. • Analysis of Boolean functions on the hypercube
On Embedding Edit Distance into L_1 Lower bound for 0-1 strings – cont. • Recall =½(H+S) where • H=random point+random bit flip • S=random point+a cyclic shift • Lemma: For all AµV, the LHS of (*) is • Proof sketch: • Assume to contrary, and define f = 1A.
On Embedding Edit Distance into L_1 Lower bound for 0-1 strings – cont. • Claim: Ij¸ 1/n1/8)Ij+1¸ 1/2n1/8 • Proof: cyclic shift S(x) x flip bit j flip bit j+1 x+ej S(x+ej) = S(x )+ej+1 cyclic shift
On Embedding Edit Distance into L_1 randomness y2n x2n … CCAbits Communication Complexity Approach Communication complexity model: • Two-party protocol • Shared randomness • Promise (gap) version • A = approximation factor • CCA = min. # bits to decide whp Alice Bob Previous communication lower bounds: • l1[Saks-Sun’02, BarYossef-Jayram-Kumar-Shivakumar’04] • l1[Woodruff’04] • Earthmover [Andoni-Indyk-K.’07] Distance Estimation Problem: decide whether d(x,y)¸R or d(x,y)·R/A
On Embedding Edit Distance into L_1 Communication Bounds for Edit Distance A tradeoff between approximation and communication • Theorem [Andoni-K.’07]: Corollary 1: Approximation A=O(1) requires CCA¸(loglog n) Corollary 2: Communication CCA=O(1) requires A ¸*(log n) For Hamming distance: CC1+ = O(1/2) [Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04] First computational model where edit is provably harder than Hamming! Implications to embeddings: • Embedding ED into L1 (or squared-L2) requires distortion *(log n) • Furthermore, holds for both 0-1 strings and permutations (Ulam)
On Embedding Edit Distance into L_1 Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity If CCA≤k then for every two distributions far,closethere is a k-bit deterministic protocol with success probability ¸ 2/3 Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols Further to above, there are Boolean functions sA,sB :n{0,1} with advantage Pr(x,y)2far[sA(x)sB(y)] – Pr(x,y)2 close[sA(x)sB(y)] ¸(2-k) Step 3 [Fourier expansion]: Reduce to one Fourier level Furthermore, sA,sBdepend only on fixed positions j1,…,j Step 4 [Choose distribution]: Analyze (x,y)2 projected on these positions Let close,farinclude -noise handle a high level Let close,farinclude (few/more) block rotations handle a low level Step 5: Reduce Ulam to {0,1}n A random mapping {0,1} works Compare this additive analysis to our previous analysis: Key property: distribution of (xj1,…,xj, yj1,…,yj) is “statistically close” under far vs. under close
Summary of Known Results Embed ({0,1}n, ED) into L1 Embed Ulam metric into L1 Upper bound: 2O(√log n) [Ostrovsky-Rabani’05] O(log n)[Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) Lower bound: (log n) [K.-Rabani’06] log n/loglog n)[Andoni-K.’07] (Qualitatively much stronger) On Embedding Edit Distance into L_1
On Embedding Edit Distance into L_1 Concluding Remarks • The computational lens • Study Distance Estimation problems rather than embeddings • Open problems: • Still large gap for 0-1 strings • Variants of edit distance (e.g. edit distance with block-moves) • Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l1) • Recent progress: • Bypass L1-embedding by devising new techniques • E.g. using max (l1) product for NNS under Ulam metric [Andoni- Indyk-K.] • Analyze/design “good” heuristics • E.g. smoothed analysis [Andoni-K.] Thank you!