1 / 39

By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab

Incremental Clustering And Dynamic Information Retrieval. By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab. Outline:. Motivation Main Problem Hierarchical Agglomerative Clustering A Model Incremental Clustering

eben
Download Presentation

By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incremental Clustering And Dynamic Information Retrieval By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab

  2. Outline: • Motivation • Main Problem • Hierarchical Agglomerative Clustering • A Model Incremental Clustering • Different incremental algorithms • Lower Bounds for incremental algorithms • Dual Problem

  3. I. Main Problem The clustering problem is as follows: given n points in a metric space M, partition the points into k clusters so as to minimize the maximum cluster diameter.

  4. 1.Greedy Incremental Clustering • Center-Greedy • Diameter-greedy

  5. a) Center-Greedy The center-greedy algorithm associates a center for each cluster and merges the two clusters whose centers are closest. The center of the old cluster with the larger radius becomes the new center Theorem: The center-greedy algorithm’s performance ratio has a lower bound of 2k - 1.

  6. -1 0 0 1 0 0 -1 1 0 0 1 1 v3 v4 v5 v1 v2 S3 S2 S2 S0 S1 a) Center-Greedy cont. Proof: • 1-Tree Construction 0 K=2

  7. a) Center-Greedy cont. • 2-Tree  Graph • Set Ai (in our example Ai={{v1},{v2}, {v3},{v4}}) 1 1-e2 1-e1 v3 v4 v5 v1 v2 S3 S2 S0 S1 S2 1 1-e3

  8. Post-Order Traverse

  9. a) Center-Greedy cont. Claims: • For 1 <= i <= 2k - 1, Ai is the set of clusters of center-greedy which contain more than one vertex after the k + i vertices v1, . . . , vk+i are given. • There is a k-clustering of G of diameter 1. The clustering which achieves the above diameter is{S0 US1, . . . , S2k-2 US2k-1}.

  10. K=4

  11. Competitiveness of Center-Greedy Theorem :The center-greedy algorithm has performance ratio of 2k-1 in any metric space.

  12. b) Diameter-Greedy The diameter-greedy algorithm always merges those two clusters which minimize the diameter of the resulting merged cluster. Theorem : The diameter-greedy algorithm’s performance ratio W(log(k)) is even on the line.

  13. b) Diameter-Greedy cont. • Proof: 1) Assumptions Ui= Uj=1Fi{{pij, qij}, {rij, sij}}, Vi= Uj=1Fi{{qij}, {rij}}, Wi= Uj=1Fi{{pij}, {qij, rij}}, Xi= Uj=1Fi{{pij}, {qij, rij}, {sij}}, Yi= Uj=1Fi{{pij, qij, rij}, {sij}}, Zi= Uj=1Fi{{pij, qij, rij, sij}}.

  14. b) Diameter-Greedy cont. • Proof: 2) Invariant : When the last element of Kt is received, diameter-greedy’s k+1 clusters are (Ui=1t-2 Zi) UYt-1U Xt (Uri=t+1 Vi). Since there are k+1 clusters, two of the clusters have to be merged and the algorithm merges two clusters in Vt+1to form a cluster of diameter (t+1). Without loss of generality, we may assume that the clusters merged are {q(t+1)1} and {r(t+1)1}.

  15. Competitiveness of Diameter-Greedy Theorem :For k = 2, the diameter-greedy algorithm has a performance ratio 3 in any metric space.

  16. 2.Doubling Algorithm • Deterministic • Randomized • Oblivious • Randomized Oblivious

  17. a) Deterministic doubling algorithm • The algorithm works in phases • At the start of phase i it has k+1 clusters • Uses a and b, s.t a/(1-a)<=b • At start of phase i the following is assumed: 1. for each cluster Cj, the radius of Cjdefined as maxp  Cj d(cj, p) is at most αdi 2. for each pair of clusters Cjand Cl, the inter-center distance d(cj, cl) => di 3. di <= opt.

  18. a) Deterministic doubling algorithm • Each phase has two stages 1- Merging stage, in which the algorithm reduces the number of clusters by merging certain pairs 2-Update stage, in which the algorithm accepts new updates and tries to maintain at most k clusters without increasing the radius of the clusters or violating the invariants A phase ends when number of clusters exceeds k

  19. a) Deterministic doubling algorithm • Definition: The t-threshold graph on a set of points P = {p1, p2, . . . , pn} is the graph G=(P,E) such that (pi, pj) in E if and only if d(pi, pj) <= t. • Merging stage defines di+1= bdi and a graph G di+1–threshold for centers c1,. . . , ck+1 . • New clusters C’1…C’m. If m=k+1 this ends the phase i

  20. a) Deterministic doubling algorithm • Lemma The pairwise distance between cluster centers after the merging stage of phase i is at least di+1. • Lemma The radius of the clusters after the merging stage of phase i is at most di+1+αdi<=αdi+1 • Update continues while the number of clusters is at most k. It is restricted by the radius bound αdi+1. Then phase i ends.

  21. a) Deterministic doubling algorithm • Initialization : the algorithm waits until k+1 points have arrived then enters phase 1, with each point as a center containing just itself. And d1 set to the distance between the closest pair of points

  22. a) Deterministic doubling algorithm • Lemma The k + 1 clusters at the end of the ith phase satisfy the following conditions: 1. The radius of the clusters is at most αdi+1. 2. The pairwise distance between the cluster centers is at least d i+1. 3. di+1<= OPT, where OPT is the diameter of the optimal clustering for the current set of points. Theorem:The doubling algorithm has performance ratio 8 in any metric space.

  23. a) Deterministic doubling algorithm Example to show the analysis is tight: k=>3. Input consists of k+3 points p1…pk+3 the points p1…pk+1 have distance 1, pk+2 ,pk+3 have distance 4 from the others, and 8 from each other.

  24. b) Randomized doubling algorithm • Choose a random variable r from [1/e,1] according to the probability density function 1/r • The min pairwise distance of the first k+1 point is x. And d1=rx • b=e, a=e/(e-1)

  25. b) Randomized doubling algorithm Theorem :The randomized doubling algorithm has expected performance ratio 2e in any metric space. The same bound is also achieved for the radius measure.

  26. c) Oblivious clustering algorithm • Does not need to know k • Assume we have un upper bound on the max distance between point which is 1. • Points are maintained in a tree

  27. c) Oblivious clustering algorithm cont. Root at depth 0 Within dist. 1/2i-1from parent At distance greater than 1/2i Where i is the depth of the vertex , i=>0

  28. Illustration Of Oblivious clustering algorithm:

  29. c) Oblivious clustering algorithm cont. • How do we obtain the k clusters from the tree? • If k is given, and i is the greatest depth containing at most k points. • These are the k cluster centers. The sub-trees of the vertices at depth i are the clusters. • As points are added, the number of vertices at depth i increases; if it goes beyond k, then we change i to i - 1, collapsing certain clusters; otherwise, the new point is inserted in one of the existing clusters.

  30. c) Oblivious clustering algorithm cont. Theorem :The algorithm that outputs the k clusters obtained from the tree construction has performance ratio 8 for the diameter measure and the radius measure. • Optimal diameter is ½i+1< d <= ½I • Then points at depth i are in different clusters, so there are at most k of them. • j=>i be the greatest depth containing at most k points. • Subtrees are at a distance of the root within ½j+ ½j+1 + ½j+2 + · · ·<= ½j-1< 4d.

  31. d) Randomized Oblivious • Distance threshold for depth i is r/ei • r chosen once at random from [1,e], according to the PDF 1/r • The expected diameter is at most 2e.OPT diameter

  32. Lower Bounds Theorem1:For k = 2, there is a lower bound of 2 and 2 - ½k/2on the performance ratio of deterministic and randomized algorithms, respectively, for incremental clustering on the line.

  33. Lower Bounds cont. Theorem2: There is a lower bound of 1+21/2on the performance ratio of any deterministic incremental clustering algorithm for arbitrary metric spaces.

  34. Lower Bounds cont.

  35. Lower Bounds cont. Theorem3:For any e>0 and k=>2, there is a lower bound of 2 - e on the performance ratio of any randomized incremental algorithm.

  36. Lower Bounds cont. Theorem4:For the radius measure, no deterministic incremental clustering algorithm has a performance ratio better than 3 and no randomized algorithm has a ratio better than 3 – e for any fixed e > 0.

  37. II. Dual Problem For a sequence of points p1,p2,...,pnRd, cover each point with a unit ball in d as it arrives, so as to minimize the total number of balls used.

  38. II. Dual Problem Rogers Theorem:Rd can be covered by any convex shape with covering density O(d log d). Theorem:For the dual clustering problem in Rd, there is an incremental algorithm with performance ratio O(2dd log d). Theorem:For the dual clustering problem in d, any incremental algorithm must have performance ratio W( (log d)/(log log log d)).

  39. Thank You

More Related