Proximity algorithms for nearly-doubling spaces

Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Proximity problems • In arbitrary metric space, some proximity problems are hard • For example, the nearest neighbor search problem requires Θ(n) time • The doubling dimension parameterizes the “bad” case… q ~1 ~1 ~1 ~1 ~1 Proximity algorithms for nearly-doubling spaces

Doubling Dimension • Definition: Ball B(x,r) = all points within distance r from x. • The doubling constant(of a metric M) is the minimum value ¸>0such that every ball can be covered by ¸balls of half the radius • First used by [Ass-83], algorithmically by [Cla-97]. • The doubling dimension is dim(M)=log ¸(M) [GKL-03] • A metric is doubling if its doubling dimension is constant • Packing property of doubling spaces • A set with diameter D and min. inter-point distance a, contains at most (D/a)O(log¸)points Here ≤7. Proximity algorithms for nearly-doubling spaces

Applications • In the past few years, many algorithmic tasks have been analyzed via the doubling dimension • For example, approximate nearest neighbor search can be executed in time ¸O(1) log n • Some other algorithms analyzed via the doubling dimension • Nearest neighbor search [KL-04, BKL-06, CG-06] • Clustering [Tal-04, ABS-08, FM-10] • Spanner construction [GGN-06, CG-06, DPP-06, GR-08] • Routing [KSW-04, Sil-05, AGGM-06, KRXY-07, KRX-08] • Travelling Salesperson [Tal-04] • Machine learning [BLL-09, GKK-10] • Message: This is an active line of research… Proximity algorithms for nearly-doubling spaces

Problem • Most algorithms developed for doubling spaces are not robust • Algorithmic guarantees don’t hold for nearly-doubling spaces • If a small fraction of the working set possesses high doubling dimension, algorithmic performance degrades. • This problem motivates the following key task • Given an n-point set S and target dimension d* • Remove from S the fewest number of points so that the remaining set has doubling dimension at most d* Proximity algorithms for nearly-doubling spaces

Two paradigms • How can removing a few “bad” points help? Two models: • 1. Ignore the bad points • Outlier detection. • [GHPT-05] cluster based on similarity, seek a large subset with low intrinsic dimension. • Algorithms with slack. Throw bad points into the slack • [KRXY-07] gave a routing algorithm with guarantees for most of the input points. • [FM-10] gave a kinetic clustering algorithm for most of the input points. • [GKK-10] gave a machine learning algorithm – small subset doesn’t interfere with learning Proximity algorithms for nearly-doubling spaces

Two paradigms • How can removing a few “bad” points help? Two models: • 2. Tailor a different algorithm for the bad points • Example: Spanner construction. A spanner is an edge subset of the full graph • Good points: Low doubling dimension sparse spanner with nice properties (low stretch and degree) • Bad points: Take the full graph • If the number of bad points is O(n.5), we have a spanner with O(n) edges Proximity algorithms for nearly-doubling spaces

Results • Recall our key problem • Given an n-point set S and target dimension d* • Remove from S the fewest number of points so that the remaining set has doubling dimension at most d* • This problem is NP-hard • Even determining the doubling dimension of a point set exactly is NP-hard! • Proof on the next slide • But the doubling dimension can be approximated within a constant factor… • Our contribution: bicriteria approximation algorithm • In time 2O(d*) n3, we remove a number of points arbitrarily close to optimal, while achieving doubling dimension 4d* + O(1) • We can also achieve near-linear runtime, at the cost of slightly higher dimension Proximity algorithms for nearly-doubling spaces

Warm up • Lemma: It is NP-hard to determine the doubling dimension of a set S • Reduction: from vertexcover with bounded degree Δ = n½. • the size of any vertex cover is at least n½. • Construction: A set S of n points corresponding to the vertex set V. • Let d(u,v) = ½ if the cor. vertices are connected by an edge • Let d(u,v) = 1 if the cor. vertices aren’t connected • Analysis: • Any subset of S found in a ball of radius ½ has at most n½ points - degree of original graph • S is a ball of radius 1. The minimum covering of all of S with balls of radius ½ is equal to the minimum vertex cover of V. • Note: reduction preserves hardness of approximation • Corollary: It is NP-hard to determine if removing k points from S can leave a set with doubling dimension d*. • So our problem is hard as well. ½ ½ 1 Proximity algorithms for nearly-doubling spaces

Bicriteria algorithm • Recall that he doubling constant(of a metric M) is • the minimum value ¸>0such that every r-radius ball can be covered by ¸balls of half the radius • Define the related notion of density constant as • the minimum value m>0 such that every r-radius ball contains at mostmpoints at mutual interpoint distancer/2 • Nice property: The density constant can only decrease under the removal of points, unlike the doubling constant. • We can show that • √m(S) ≤ ¸(S) ≤ m(S) • it’s NP-hard to compute the density constant (ratio-preserving reduction from independent set) l=2, m=3 Proximity algorithms for nearly-doubling spaces

Bicriteria algorithm • We will give a bicriteria algorithm for the density constant. Problem statement: • Given an n-point set S and target density constant m* • Remove from S the fewest number of points so that the remaining set has density constant at most m* • A bicriteria algorithm for the density constant is itself a bicriteria algorithm for the doubling constant • within a quadratic factor Proximity algorithms for nearly-doubling spaces

Witness set • Given a set S, a subset S’ is a witness set for the density constant if • All points are at interpoint distance at least r/2 • Note that S’ is a concise proof that the density constant of S is at least |S’| • Theorem: Fix a value m’< m(S). A witness set of S of size at least √m‘ can be found in time 2O(m*) n3 • Proof outline: • For each point p and radius r define the r-ball of p. • Greedily cover all points in the r-ball with disjoint balls of radius r/2. • Then cover all points in each r/2 ball with disjoint balls of radius r/4. • Since there exists in S a witness set of size m(S), there exists a p and r so that • either there are √m(S)r/2 balls, and these form a witness set, or • one r/2 ball covers √m(S)r/4 balls, and these form a witness set. Proximity algorithms for nearly-doubling spaces

Bicriteria algorithm • Recall our problem • Given an n-point set S and target density constant m* • Remove from S the fewest number of points so that the remaining set has density constant at most m* • Our bricriteria solution: • Let k be the true answer (the minimum number of points that must be removed). • We remove kc/(c-1) points and the remaining set has density constant c2m*2 Proximity algorithms for nearly-doubling spaces

Bicriteria algorithm • Algorithm • Run the subroutine to identify a witness set of size at least cm* • Remove it • Repeat • Analysis • The density constant of the resulting set is not greater than c2m*2 • since we terminated without finding a witness set of size at least cm* • Every time a witness set of size w>cm* is removed by our algorithm, the optimal algorithm must remove at least w-m* points • or else the true solution would have density constant greater than m* • It follows that are algorithm removes k w/(w-m*) < kc/(c-1) points Proximity algorithms for nearly-doubling spaces

Conclusion • We conclude that there exists a bicriteria algorithm for the density constant • We remove kc/(c-1) points and the remaining set has density constant c2m*2 • It follows that there exists a bricriteria algorithm for the doubling constant • We remove kc/(c-1) points and the remaining set has doubling constant c4¸*4 Proximity algorithms for nearly-doubling spaces

Proximity algorithms for nearly-doubling spaces