1 / 32

Clustering (1)

Clustering (1). Clustering Similarity measure Hierarchical clustering Model-based clustering. Figures from the book Data Clustering by Gan et al. Clustering. Objects in a cluster should share closely related properties have small mutual distances

leon
Download Presentation

Clustering (1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.

  2. Clustering Objects in a cluster should share closely related properties have small mutual distances be clearly distinguishable from objects not in the same cluster A cluster should be a densely populated region surrounded by relatively empty regions. Compact cluster --- can be represented by a center Chained cluster --- higher order structures

  3. Clustering

  4. Clustering The process of clustering

  5. Clustering Types of clustering:

  6. Similarity measures A distance function should satisfy

  7. Similarity measures Similarity function:

  8. Similarity measures From a dataset, Distance matrix: Similarity matrix:

  9. Similarity measures Euclidean distance Mahattan distance Mahattan segmental distance (using only part of the dimensions)

  10. Similarity measures Maximum distance (sup distance) Minkowski distance. This is the general case. R=2, Euclidean distance; R=1, Manhattan distance; R=∞, maximum distance.

  11. Similarity measures Mahalanobis distance It is invariant under non-singular transformations The new covariant matrix is

  12. Similarity measures The Mahalanobis distance doesn’t change

  13. Similarity measures Chord distance: the length of the chord joining the two normalized points within a hypersphere of radius one Geodesic distance: the length of the shorter arc connecting the two normalized data points at the surface of the hypersphere of unit radius

  14. Similarity measures

  15. Similarity measures Categorical data: In one dimension: Simple matching distance: Taking category frequency into account:

  16. Similarity measures For more general definitions of similarity, define: Number of match: Number of match to NA (? means missing here): Number of non-match:

  17. Similarity measures

  18. Similarity measures Binary feature vectors: Define: S is the number of occurrences of the case.

  19. Similarity measures

  20. Similarity measures Mixed-type data: General similarity coefficient by Gower: For quantitative attributes, (R is range) , if neither is missing. For binary attributes, if xk=1 & yk=1; if xk=1 or yk=1. For nominal attributes, if xk= yk; if neither is missing.

  21. Similarity measures Similarity between clusters Mean-based distance: Nearest neighbor Farthest neighbor Average neighbor

  22. Hierarchical clustering Agglomerative: build tree by joining nodes; Divisive: build tree by dividing groups of objects.

  23. Hierarchical clustering Example data:

  24. Hierarchical clustering Single linkage: find the distance between any two nodes by nearest neighbor distance.

  25. Hierarchical clustering Single linkage:

  26. Hierarchical clustering Complete linkage: find the distance between any two nodes by farthest neighbor distance. Average linkage: find the distance between any two nodes by average distance.

  27. Hierarchical clustering Comments: Hierarchical clustering generates a tree; to find clusters, the tree needs to be cut at a certain height; Complete linkage method favors compact, ball-shaped clusters; single linkage method favors chain-shaped clusters; average linkage is somewhere in between.

  28. Model-based clustering Impose certain model assumptions on potential clusters; try to optimize the fit between data and model. The data is viewed as coming from a mixture of probability distributions; each of the distributions represents a cluster.

  29. Model-based clustering For example, if we believe the data come from a mixture of several Gaussian densities, the likelihood that data point i is from cluster j is:

  30. Model-based clustering Given the number of clusters, we try to maximize the likelihood Where is the probability that the observation belongs to cluster j The most commonly used method is the EM algorithm. It iterates between soft cluster assignment and parameter estimation.

  31. Model-based clustering

  32. Model-based clustering Gaussian cluster models. Common assumptions: From 1 to 4, the model becomes more flexible, yet more parameters need to be estimated. May become less stable.

More Related