Stability Yields a PTAS for k -Median and k -Means Clustering

Stability Yields a PTAS for k-Median and k-Means Clustering PranjalAwasthi, Avrim Blum, Or Sheffet Carnegie Mellon University November 3rd, 2010

Stability Yields a PTAS for k-Median and k-Means Clustering • Introduce k-Median / k-Means problems. • Define stability • Previous notion [ORSS06] • Weak Deletion Stability • ¯-distributed instances • The algorithm for k-Median • Conclusion + open problems.

Clustering In Real Life Clustering: come up with desired partition

Clustering in a Metric Space Clustering: come up with desired partition Input • n points • A distance function d:n£n!R¸0 satisfying: • Reflexive: 8p, d(p,p) = 0 • Symmetry: 8p,q, d(p,q) = d(q,p) • Triangle Inequality: 8p,q,r, d(p,q) ·d(p,r)+d(r,q) • k-partition q p r

Clustering in a Metric Space Clustering: come up with desired partition Input: • n points • A distance function d:n£n!R¸0 satisfying: • Reflexive: 8p, d(p,p) = 0 • Symmetry: 8p,q, d(p,q) = d(q,p) • Triangle Inequality: 8p,q,r, d(p,q) ·d(p,r)+d(r,q) • k-partition k is large, e.g. k=polylog(n)

k-Median • Input: 1. n points in a finite metric space 2. k • Goal: • Partition into k disjoint subsets:C*1, C*2 , … , C*k • Choose a center per subset • Cost: cost(C*i)=xd(x,c*i) • Cost of partition: icost(C*i) • Given centers ) Easy to get best partition • Given partition ) Easy to get best centers

k-Means • Input: 1. n points in Euclidean space 2. k • Goal: • Partition into k disjoint subsets :C*1, C*2 , … , C*k • Choose a center per subset • Cost: cost(C*i)=xd2(x, c*i) • Cost of partition: icost(C*i) • Given centers ) Easy to get best partition • Given partition ) Easy to get best centers

We Would Like To… • Solve k–median/ k-means problems. • NP-hard to get OPT(= cost of optimal partition) • Find a c-approximation algorithm • A poly-time algorithm guaranteed to output a clustering whose cost·cOPT • Ideally, find a PTAS • Get a c-approximation algorithm where c =(1+²), for any ²>0. • Runtime can be exp(1/²) cOPT Alg 2OPT Alg 1.5 OPT Alg 1.1 OPT OPT Alg Polynomial Time Approximation Scheme 0

Related Work k-Median k-Means Easy (try all centers) in time nk PTAS, exponential in (k/²) [KSS04] Small k • (3+²)-apx [GK98, CGTS99, AGKMMP01, JMS02, dlVKKR03] • (1.367...)-apx hardness [GK98, JMS02] 9-apx [OR00, BHPI02, dlVKKR03, ES04, HPM04, KMNPSW02] General k No PTAS! Euclidean k-Median [ARR98], PTAS if dimension is small (loglog(n)c) [ORSS06] Special case • We focus on large k(e.g. k=polylog(n)) • Runtime goal: poly(n,k)

World All possible instances

ORSS Result (k-Means)

ORSS Result (k-Means) Why use 5 sites?

ORSS Result (k-Means)

ORSS Result (k-Means) • Instance is stable if OPT(k-1) > (1/®)2 OPT(k) (require 1/® > 10) • Give a (1+O(®))-approximation. Our Result (k-Means) • Instance is stable if • OPT(k-1) > (1+®) OPT(k) (require ®> 0) • Give a PTAS ((1+²)-approximation). • Runtime: poly(n,k) exp(1/®,1/²)

Philosophical Note • Stable instances: 9®>0 s.t. OPT(k-1) > (1+®) OPT(k) • Not stable instances: 8®>0 s.t. OPT(k-1) · (1+®) OPT(k) • A (1+®)-approximation can return a (k-1)-clustering. • Any PTAS can return a (k-1)-clustering. • It is not a k-clustering problem, • It is a (k-1)-clustering problem! • If we believe our instance inherently has k clusters “Necessary condition“ to guarantee: PTAS will return a “meaningful” clustering. • Our result: It’s a sufficient condition to get a PTAS.

World All possible instances Any (k-1) clustering is significantly costlier than OPT(k) ORSS Stable

A Weaker Guarantee Why use 5 sites?

A Weaker Guarantee

(1+®)-Weak Deletion Stability • Consider OPT(k). • Take any cluster C*i, associate its points with c*j. • This increases the cost to at least (1+®)OPT(k). • An obvious relaxation of ORSS-stability. • Our result: suffices to get a PTAS. ) c*j c*j c*i

World All possible instances ORSS Stable Weak-Deletion Stable Merging any two clusters in OPT(k) increases the cost significantly

¯-Distributed Instances For every cluster C*i, and every p not in C*i, we have: We show that: • k-median: (1+®)-weak deletion stability )(®/2)-distributed. • k-means: (1+®)-weak deletion stability )(®/4)-distributed. p c*i

Claim:(1+®)-Weak Deletion Stability )(®/2)-Distributed p c*i c*j ·xd(x, c*j) - xd(x, c*i) ®OPT ·x [d(x, c*i) + d(c*i, c*j)] - xd(x, c*i) = x d(c*i, c*j) = |C*i|d(c*i, c*j) ) ®(OPT/|C*i|) · d(c*i, c*j) ·d(c*i, p) + d(p, c*j) · 2d(c*i, p)

World All possible instances ORSS Stable Weak-Deletion Stable ¯-Distributed In optimal solution: large distance between a center to any “outside” point

Main Result • We give a PTAS for ¯-distributed k-median and k-means instances. • Running time: • There are NP-hard ¯-distributed instances. (Superpolynomial dependence on 1/² is unavoidable!)

Stability Yields a PTAS for k-Median and k-Means Clustering • Introduce k-Median / k-Means problems. • Define stability • PTAS for k-Median • High level description • Intuition (“had only we known more…”) • Description • Conclusion + open problems.

k-Median Algorithm’s Overview Input:Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers L := List of “suspected” clusters’ “cores”

k-Median Algorithm’s Overview Input:Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers • Right definition of “core”. • Get the core of each cluster. • L can’t get too big.

Intuition: “Mind the Gap” • We know: • In contrast, an “average” cluster contributes: • So for an “average” point p, in an “average” cluster C*i,

Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i c*i

Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i • Formally, call cluster C*icheap if • Assume all clusters are cheap. • In general: we brute-force guess O(1/¯²) centers of expensive clusters in Stage 0.

Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i • Formally, call cluster C*icheap if • Markov:At most (²/4) fraction of the points of acheap cluster, lie outside the core.

Intuition: “Mind the Gap” • We know: • Denote the core of a cluster C*i • Formally, call cluster C*icheap if • Markov: At least half of the points of a cheap cluster lie inside the core.

Magic (r/4) Ball If p belongs to the core ) B(p, r/4) contains ¸ |C*i|/2 pts. Denote r = ¯(OPT/|C*i|). “Heavy”: Mass ¸ |C*i|/2 r/4 > r · r/8 c*i p

Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i All points in the core are merged into one set!

Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i Could we merge core pts with pts from other clusters?

Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p x > r · r/8 c*i r/2·d(p,c*i)·3r/4 r/4= r/2 - r/4·d(x,c*i)·3r/4 + r/4 = r

Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4·d(x,c*i)·r x falls outside the core x belongs to C*i

Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4·d(x,c*i)·r More than |C*i|/2 pts fall outside the core! )(

Finding the Right Radius Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. • Problem: we don’t know |C*i| • Solution: Try all sizes, in order! • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Complication: • When s gets small (s=4,3,2,1) we collect many “leftovers” of one cluster. • Solution: once we add a subset to L, we remove close-by points. Denote r = ¯(OPT/|C*i|).

Population Stage • Set s = n, n-1, n-2, …, 1 • Set rs = ¯(OPT/s) • Draw a ball of radius r/4 around each point • Unite balls containing ¸s/2 pts whose centers overlap • Once a set ¸s/2 is found • Put this set in L • Remove all points in a (r/2)-”buffer zone” from L.

Stability Yields a PTAS for k -Median and k -Means Clustering

Stability Yields a PTAS for k -Median and k -Means Clustering

Presentation Transcript

k -means Clustering

K-means Clustering

K-means Clustering

K means Clustering ( Weka )

Canopy Clustering and K-Means Clustering

k NN , K- Means, Clustering and Bayesian Inference

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

K-means Clustering

Initial K-Means Clustering :

Lecture 8 K-means for clustering

K-means Clustering

Determining the ‘k’ in k-Means Clustering

K-means Clustering

Clustering Beyond K -means

Clustering: K-Means

A Fast PTAS for k-Means Clustering

K-means clustering

Vertical K Median Clustering