1 / 70

Different Perspectives at Clustering: The “Number-of- Clusters ” Case

Different Perspectives at Clustering: The “Number-of- Clusters ” Case. B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006. Different Perspectives at Number of Clusters: Talk Outline. Clustering and K-Means: A discussion

lucian
Download Presentation

Different Perspectives at Clustering: The “Number-of- Clusters ” Case

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Different Perspectives at Clustering: The “Number-of-Clusters”Case B. Mirkin School of Computer Science Birkbeck College, University of London IFCS 2006

  2. Different Perspectives at Number of Clusters: Talk Outline Clustering and K-Means: A discussion Clustering goals and four perspectives Number of clusters in: - Classical statistics perspective - Machine learning perspective - Data Mining perspective (including a simulation study with 8 methods) - Knowledge discovery perspective (including a comparative genomics project)

  3. WHAT IS CLUSTERING; WHAT IS DATA • K-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Interpretation Aids • WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering • DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward;Extensions to Other Data Types; One-by-One Clustering • DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters • GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

  4. Example: W. Jevons (1835-1882), updated in Mirkin 1996 Pluto doesn’t fit in the two clusters of planets

  5. Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001)

  6. Clustering: Main Steps • Data collecting • Data pre-processing • Finding clusters (the only step appreciated in conventional clustering) • Interpretation • Drawing conclusions

  7. Conventional Clustering: Cluster Algorithms • Single Linkage: Nearest Neighbour • Ward Agglomeration • Conceptual Clustering • K-means • Kohonen SOM • ………………….

  8. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids (@) • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  9. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  10. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *

  11. K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * * @ * * * @ * * * * ** * * * @

  12. Advantages of K-Means • Conventional: • Models typology building • Computationally effective • Can be incremental, `on-line’ • Unconventional: • Associates feature salience with feature scales and correlation/association • Applicable to mixed scale data

  13. Drawbacks of K-Means • No advice on: • Data pre-processing • Number of clusters • Initial setting • Instability of results • Criterion can be inadequate • Insufficient interpretation aids

  14. Initial Centroids: Correct Two cluster case

  15. Initial Centroids: Correct Final Initial

  16. Different Initial Centroids

  17. Different Initial Centroids: Wrong, even though in different clusters Initial Final

  18. Two types of goals (with no clear-cut borderline) • Engineering goals • Data analysis goals

  19. Engineering goals (examples) • Devising a market segmentation to minimise the promotion and advertisement expenses • Dividing a large scheme into modules to minimise the cost • Organisation structure design

  20. Data analysis goals (examples) • Recovery of the distribution function • Prediction • Revealing patterns in data • Enhancing knowledge with additional concepts and regularities Each of these is realised in a different perspective at clustering

  21. Clustering Perspectives • Classical statistics: Recovery of a multimodal distribution function • Machine learning:Prediction • Data mining:Revealing patterns in data • Knowledge discovery:additional concepts and regularities

  22. Clustering Perspectives at # Clusters • Classical statistics: As many as meaningful modes (mixture items) • Machine learning: As many as needed for acceptable prediction • Data mining: As many as meaningful patterns in data (including incomplete clustering) • Knowledge discovery: As many as needed to produce concepts and regularities adequate to the domain

  23. Main Sources for Deriving # Clusters • Classical statistics: Model of the world • Machine learning: Cost & Accuracy Trade Off • Data mining: Data • Knowledge discovery: Domain knowledge

  24. Classical Statistics Perspective • There must be a model of data generation • E.g., Mixture of Gaussians • The task: identify all parameters of the model by using observed data • E.g., The number of Gaussians and their probabilities, means and covariances

  25. Mixture of 3 Gaussian densities

  26. Classical statistics perspective on K-Means But a maximum likelihood method with spherical Gaussians of the same variance: -within a cluster,all variables are independent and Gaussian with the same cluster-independent variance (z - scoring is a must then); - the issue of the number of clusters can be approached with conventional approaches to hypothesis testing

  27. Machine learning perspective • Clusters should be of help in learning data incrementally generated • The number should be specified by the trade-off between accuracy and cost • A criterion should guarantee partitioning of the feature space with clearly separated high density areas; • A method should be proven to be consistent with the criterion on the population

  28. Machine learning on K-Means • The number of clusters: to be specified according to prediction goals • Pre-processing: no advice • An incremental version of K-means converges to a minimum of the summary within-cluster variance, under conventional assumptions of data generation (McQueen 1967 – major reference, though the method is traced a dozen or two years earlier)

  29. Data mining perspective

  30. Type of Data Similarity Temporal Entity-to-feature Co-occurrence Type of Model Regression Principal components Clusters Data recovery framework fordata mining methods • Model: • Data = Model_Derived_Data + Residual • Pythagoras: • Data2 = Model_Derived_Data2 + Residual2 • The better fit the better the model

  31. K-Means as a data recovery method

  32. Representing a partition Clusterk: Centroid ckv (v - feature) Binary 1/0 membership zik (i - entity)

  33. Basic equations (analogous to PCA) y – data entry, z - membership c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

  34. Meaning of Data scatter • The sum of contributions of features – the basis for feature pre-processing (dividing by range, not std) • Proportional to the summary variance

  35. Contribution of a feature Fto a partition • Proportional to • correlation ratio 2 if F is quantitative • a contingency coefficient between cluster partition and F, if F is nominal: • Pearson chi-square (Poisson normalised) • Goodman-Kruskal tau-b (Range normalised) Contrib(F) =

  36. Contribution of a quantitative feature to a partition • Proportional to • correlation ratio 2 if F is quantitative

  37. Contribution of a nominal feature to a partition • Proportional to a contingency coefficient • Pearson chi-square (Poisson normalised) • Goodman-Kruskal tau-b (Range normalised) • Bj=1

  38. Pythagorean Decomposition of data scatter forinterpretation

  39. Contribution based description of clusters • C. Dickens: FCon = 0 • M. Twain: LenD < 28 • L. Tolstoy: NumCh > 3 or Direct = 1

  40. Principal Cluster Analysis (Anomalous Pattern) Method yiv =cv zi + eiv, where zi = 1 ifiS, zi = 0 ifiS With Euclidean distance squared cS must be anomalous, that is, interesting

  41. Tom Sawyer Initial setting with Anomalous Single Cluster for iK-Means

  42. 1 2 Tom Sawyer 3 iK-Means with Anomalous Single Clusters 0

  43. Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final

  44. Simulation study of 8 methods (joint work with Mark Chiang) Number-of clustersmethods: • Variance based: • Hartigan(HK) • Calinski & Harabasz (CH) • Jump Statistic (JS) • Structure based: • Silhouette Width (SW) • Consensus based: • Consensus Distribution area (CD) • Consensus Distribution mean (DD) • Sequential extraction of APs: • Least Square (LS) • Least Moduli (LM)

  45. Data generation for the experiment • Gaussian Mixture (6,7,9 clusters) with: • Cluster spatial size: • - Constant (spherical) • - k-proportional • - k2-proportional • Cluster spread (distance between centroids)

  46. Evaluation of results: Estimated clustering versus that generated • Number of clusters • Distance between centroids • Similarity between partitions

  47. Distance between estimated centroids (o) and those generated (o) • Prime Assignment e1(q1) e2(q2) e3(q3) g1------e2 g2------e4 g3------e5 G1(p1) G2(p2) e4(q4) e5(q5) G3(p3)

  48. Distance between estimated centroids (o) and those generated (o) • Final Assignment e1(q1) e2(q2) e3(q3) g1------e2, e1 g2------e4, e3 g3------e5 G1(p1) G2(p2) e4(q4) e5(q5) G3(p3)

  49. Distance between centroids: quadratic and city-block g1(p1)------e1(q1), e2(q2) g2(p2)------e3(q3), e4(q4) g3(p3)------e5(q5) • Assignment • Distancing d1=(q1*d(g1,e1)+q2*d(g1,e2))/(q1+q2) d2=(q3*d(g2,e3)+q4*d(g2,e4))/(q3+q4) d3=(q5*d(g3,e5)/q5

  50. Distance between centroids: quadratic and city-block • Assignment • Distancing • 3. Averaging p1*d1+p2*d2+p3*d3

More Related