Clustering

Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !

Objectives of Cluster Analysis • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Competing objectives Intra-cluster distances are minimized Inter-cluster distances are maximized The commonest form of unsupervised learning

Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity

Google News: automatic clustering gives an effective news presentation metaphor

Sec. 16.1 For improving search recall • Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs • Therefore, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: The query “car” will also return docs containing automobile

Sec. 16.1 For better visualization/navigation of search results

Sec. 16.2 Issues for clustering • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven?

Notion of similarity/distance • Ideal: semantic similarity • Practical: term-statistical similarity • Docs as vectors • We will use cosine similarity. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.

Clustering Algorithms • Flat algorithms • Create a set of clusters • Usually start with a random (partial) partitioning • Refine it iteratively • K means clustering • Hierarchical algorithms • Create a hierarchy of clusters (dendogram) • Bottom-up, agglomerative • Top-down, divisive

Hard vs. soft clustering • Hard clustering: Each document belongs to exactly one cluster • More common and easier to do • Soft clustering: Each document can belong to more than one cluster. • Makes more sense for applications like creating browsable hierarchies • News is a proper example • Search results is another example

Flat & Partitioning Algorithms • Given: a set of n documents and the number K • Find: a partition in K clusters that optimizes the chosen partitioning criterion • Globally optimal • Intractable for many objective functions • Ergo, exhaustively enumerate all partitions • Locally optimal • Effective heuristic methods: K-means and K-medoids algorithms

Sec. 16.4 K-Means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: • Reassignment of instances to clusters is based on distance to the current cluster centroids.

Sec. 16.4 K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds. Until clustering converges (or other stopping criterion): For each doc di: Assign di to the cluster crsuch that dist(di, sr) is minimal. For each cluster cj sj= (cj)

Sec. 16.4 Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x K Means Example (K=2) Reassign clusters Converged!

Sec. 16.4 Termination conditions • Several possibilities, e.g., • A fixed number of iterations. • Doc partition unchanged. • Centroid positions don’t change.

Sec. 16.4 Convergence • Why should the K-means algorithm ever reach a fixed point? • K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm • EM is known to converge • Number of iterations could be large • But in practice usually isn’t

Sec. 16.4 Convergence of K-Means • Define goodness measure of cluster c as sum of squared distances from cluster centroid: • G(c,s)= Σi (di – sc)2 (sum over all di in cluster c) • G(C,s) = Σc G(c,s) • Reassignment monotonically decreases G • It is a coordinate descent algorithm (opt one component at a time) • At any step we have some value for G(C,s) 1) Fix s, optimize C  assign d to the closest centroid  G(C’,s) < G(C,s) 2) Fix C’, optimize s  take the new centroids  G(C’,s’) < G(C’,s) < G(C,s) The new cost is smaller than the original one  local minimum

Sec. 16.4 Time Complexity • The centroids are K • Each doc/centroid consists of M dimensions • Computing distance btw vectors is O(M) time. • Reassigning clusters: Each doc compared with all centroids, O(KNM) time. • Computing centroids: Each doc gets added once to some centroid, O(NM) time. Assume these two steps are each done once for I iterations: O(IKNM).

Sec. 16.4 Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic • doc least similar to any existing centroid • According to a probability distribution that depends inversely-proportional on the distance from the other current centroids Example showing sensitivity to seeds In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}

How Many Clusters? • Number of clusters K is given • Partition n docs into predetermined number of clusters • Finding the “right” number of clusters is part of the problem • Can usually take an algorithm for one flavor and convert to the other.

Bisecting K-means Variant of K-means that can produce a partitional or a hierarchical clustering

Bisecting K-means Example

K-means Pros • Simple • Fast for low dimensional data • It can find pure sub-clusters if large number of clusters is specified (but, over-partitioning) Cons • K-Means cannot handle non-globular data of different sizes and densities • K-Means will not identify outliers • K-Means is restricted to data which has the notion of a center (centroid)

Ch. 17 animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents • One approach: recursive application of a partitional clustering algorithm

Strengths of Hierarchical Clustering • No assumption of any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Sec. 17.1 Hierarchical Agglomerative Clustering (HAC) • Starts with each doc in a separate cluster • Then repeatedly joins the closest pairof clusters, until there is only one cluster. • The history of mergings forms a binary tree or hierarchy.

Sec. 17.2 Closest pair of clusters • Single-link • Similarity of the closest points, the most cosine-similar • Complete-link • Similarity of the farthest points, the least cosine-similar • Centroid • Clusters whose centroids are the closest (or most cosine-similar) • Average-link • Clusters whose average distance/cosine between pairs of elements is the smallest

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • Single link (MIN) • Complete link (MAX) • Centroids • Average Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Centroids • Average Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to define Inter-Cluster Similarity   • MIN • MAX • Centroids • Average Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Centroids • Average Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . Starting Situation • Start with clusters of individual points and a proximity matrix Proximity Matrix

C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • After some merging steps, we have some clusters C3 C4 C1 Proximity Matrix C5 C2

C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C3 C4 C1 Proximity Matrix C5 C2

After Merging • The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? C3 ? ? ? ? C2 U C5 C4 C3 ? ? C4 C1 Proximity Matrix C2 U C5

1 2 3 4 5 Cluster Similarity: MIN or Single Link • Similarity of two clusters is based on the two most similar (closest) points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph.

5 1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram

Two Clusters Strength of MIN Original Points • Can handle non-elliptical shapes

Two Clusters Limitations of MIN Original Points • Sensitive to noise and outliers

1 2 3 4 5 Cluster Similarity: MAX or Complete Linkage • Similarity of two clusters is based on the two least similar (most distant) points in the different clusters • Determined by all pairs of points in the two clusters

4 1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram

Two Clusters Strength of MAX Original Points • Less susceptible to noise and outliers

Two Clusters Limitations of MAX Original Points • Tends to break large clusters • Biased towards globular clusters

1 2 4 5 3 Cluster Similarity: Average • Proximity of two clusters is the average of pairwise proximity between points in the two clusters.

5 4 1 2 5 2 3 6 1 4 3 Hierarchical Clustering: Average Nested Clusters Dendrogram

5 1 5 3 1 4 1 2 5 2 5 2 1 5 5 2 2 2 3 6 3 6 3 6 3 3 1 4 4 1 4 4 4 Hierarchical Clustering: Comparison MAX MIN Average

Sec. 16.3 How to evaluate clustering quality ? Assesses a clustering with respect to ground truth … requires labeled data Produce the gold standard is costly !!

A different approach: Co-Clustering • Co-occurrence Matrices Characteristics • Data sparseness • High dimension • Noise • Is it possible to combine the document and term clustering together? • Yes, Co-Clustering • Simultaneously cluster the rows and columns of the co-occurrence matrix.

Information-Theoretic Co-Clustering • View co-occurrence matrix as a joint probability distribution between row & column random variables • We seek a hard-clustering of both dimensions such that loss in “Mutual Information” is minimized given a fixed no. of row & col. clusters

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering