1 / 18

Approximating Edit Distance in Near-Linear Time

Approximating Edit Distance in Near-Linear Time. Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT). Edit Distance. For two strings x,y  ∑ n ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution

ormand
Download Presentation

Approximating Edit Distance in Near-Linear Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximating Edit Distance in Near-Linear Time Alexandr Andoni (MIT) Joint work with Krzysztof Onak (MIT)

  2. Edit Distance • For two strings x,y  ∑n • ed(x,y) = minimum number of edit operations to transform x into y • Edit operations = insertion/deletion/substitution • Important in: computational biology, text processing, etc Example: ED(0101010, 1010101) = 2

  3. Computing Edit Distance • Problem: compute ed(x,y) for given x,y{0,1}n • Exactly: • O(n2)[Levenshtein’65] • O(n2/log2 n) for |∑|=O(1)[Masek-Paterson’80] • Approximately in n1+o(1) time: • n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving over [Myers’86, BarYossef-Jayram-Krauthgamer-Kumar’04] • Sublinear time: • ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami’03]

  4. Computing via embedding into ℓ1 • Embedding: f:{0,1}n→ ℓ1 • such that ed(x,y) ≈ ||f(x) - f(y)||1 • up to some distortion (=approximation) • Can compute ed(x,y) in time to compute f(x) • Best embedding by [Ostrovsky-Rabani’05]: • distortion = 2Õ(√log n) • Computation time: ~n2 randomized (and similar dimension) • Helps for nearest neighbor search, sketching, but not computation…

  5. Our result • Theorem: Can compute ed(x,y) in • n*2Õ(√log n)time with • 2Õ(√log n)approximation • While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding

  6. Sketcher’s hat • 2 examples of “sketches” from embeddings… • [Johnson-Lindenstrauss]: pick a random k-subspace of Rn, then for any q1,…qnRn, if q̃i is projection of qi, then, w.h.p. • ||qi-qj||2≈ ||q̃i-q̃j||2 up to O(1) distortion. • for k=O(log n) • [Bourgain]:given n vectors qi, can construct n vectors q̃i of k=O(log2 n) dimension such that • ||qi-qj||1≈ ||q̃i-q̃j||1 up to O(log n) distortion.

  7. Our Algorithm x y i z= • For each length m in some fixed set L[n], compute vectors vimℓ1 such that • ||vim – vjm||1≈ ed( z[i:i+m], z[j:j+m] ) • Dimension of vim is only O(log2 n) • Vectors {vim} are computed recursively from {vik} corresponding to shorter substrings (smaller kL) • Output: ed(x,y)≈||v1n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|=|y|) z[i:i+m]

  8. Idea: intuition ||vim – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] ) • How to compute {vim} from {vik} for k<<m ? • [OR] show how to compute some {wim} with same property, but which have very high dimension(~m) • Can apply [Bourgain] to vectors {wim}, • Obtain vectors {vim}of polylogaritmic dimension • Incurs “only” O(log n) distortion at this step of recursion (which turns out to be ok). • Challenge: how to do this in Õ(n) time?!

  9. Key step: embeddings of shorter substrings • Main Lemma: fix n vectors viℓ1k, of dimension k=O(log2n). • Let s<n. Define Ai={vi, vi+1, …, vi+s-1}. • Then we can compute vectors qiℓ1ksuch that • ||qi – qj||1≈ EMD(Ai, Aj) up to distortion logO(1) n • Computing qi’s takes Õ(n) time. embeddings of longer substrings* EMD(A,B)=min-cost bipartite matching* * cheating…

  10. Proof of Main Lemma EMD over n sets Ai • “low” = logO(1) n • Graph-metric: shortest path on a weighted graph • Sparse: Õ(n) edges • minkM is semi-metric on Mk with “distance” dmin,M(x,y)=mini=1..kdM(xi,yi) O(log2 n) minlowℓ1high O(1) minlowℓ1low O(log n) minlowtree-metric O(log3n) sparse graph-metric [Bourgain] (efficient) O(log n) ℓ1low

  11. EMD over n sets Ai Step 1 O(log2 n) minlowℓ1high • q.e.d.

  12. minlowℓ1high Step 2 O(1) minlowℓ1low • Lemma 2: can embed an n point set from ℓ1H into minO(log n)ℓ1k, for k=log3n, with O(1) distortion. • Use weak dimensionality reduction in ℓ1 • Thm [Indyk’06]: Let A be a random* matrix of size H by k=log3n. Then for any x,y, letting x̃=Ax, ỹ=Ay: • no contraction: ||x̃-ỹ||1≥||x-y||1(w.h.p.) • 5-expansion: ||x̃-ỹ||1≤5*||x-y||1 (with 0.01probability) • Just use O(log n) of such embeddings • Their min is O(1) approximation to ||x-y||1, w.h.p.

  13. Efficiency of Step 1+2 • From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into minlowℓ1low • Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai) • Save using linearity of sketches: • f() is linear: f(A) = ∑aA f(a) • Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) • Compute f(Ai) in order, for a total of Õ(n) time

  14. minlowℓ1low Step 3 O(log n) minlowtree-metric • Lemma 3: can embed ℓ1 over {0..M}p into minlowtree-m, with O(log n) distortion. • For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate ∞ Δ 

  15. minlowtree-metric Step 4 O(log3n) sparse graph-metric • Lemma 4: suppose have n points in minlowtree-m,which approximates a metric up to distortion D. Can embed into a graph-metric of size Õ(n) with distortion D.

  16. sparse graph-metric Step 5 O(log n) ℓ1low • Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1lowwith O(log n) distortion in Õ(m) time. • Just implement [Bourgain]’s embedding: • Choose O(log2 n) sets Bi • Need to compute the distance from each node to each Bi • For each Bican compute its distance to each node using Dijkstra’s algorithm in Õ(m) time

  17. Summary of Main Lemma EMD over n sets Ai • Min-product helps to get low dimension (~small-size sketch) • bypasses impossibility of dim-reduction in ℓ1 • Ok that it is not a metric, as long as it is close to a metric O(log2 n) minlowℓ1high O(1) oblivious minlowℓ1low O(log n) minlowtree-metric O(log3n) sparse graph-metric non-oblivious O(log n) ℓ1low

  18. Conclusion • Theorem: can compute ed(x,y) in n*2Õ(√log n)time with 2Õ(√log n)approximation

More Related