1 / 56

Data Mining (and machine learning)

Data Mining (and machine learning). DM Lecture 5: Clustering. Today. (unsupervised) Clustering What and why A good first step towards understanding your data Discover patterns and structure in data Which then guides further data mining Helps to spot problems and outliers

Download Presentation

Data Mining (and machine learning)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining(and machine learning) DM Lecture 5: Clustering David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  2. Today • (unsupervised) Clustering • What and why • A good first step towards understanding your data • Discover patterns and structure in data • Which then guides further data mining • Helps to spot problems and outliers • Identifies `market segments’, e.g. specific types of customers or users ; this is called segmentation or market segmentation • How to do it: • Choose distance measure or a similarity measure • Run a (usually) simple algorithm • We will cover the two main algorithms David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  3. What is Clustering? Why do it? Consider these data: Made up; maybe they are 17 subscribers to a mobile phone services company, and show the mean calls per day and the mean monthly bill for each customer Do you spot any patterns or Structure ?? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  4. What is Clustering? Why do it? Here is a plot of the data, with calls as X and bills as Y Now do you spot any patterns or structure ?? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  5. What is Clustering? Why do it? Clearly there are two clusters -- two distinct types of customer Top left: few calls but highish bills; bottom right: many calls, low bills David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  6. So, clustering is all about plotting/visualising and noting distinct groups by eye, right? • Not really, because: • We can only spot patterns by eye (i.e. with our brains) if • the data is 1D, 2D or 3D. Most data of interest is much higher • dimensional – e.g. 10D, 20D, 1000D. • Sometimes the clusters are not so obvious as a bunch of data • all in the same place – we will see examples. • So: we need automated algorithms which can do what you • just did (find distinct groups in the data), but which can do • this for any number of dimensions, and for perhaps more • complex kinds of groups. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  7. OK, so when we apply an automated clustering algorithm to data, the result is a collection of groups, which tells us potentially useful things about our data – e.g. if we are doing this with supermarket baskets, each group is a collection of typical baskets, which may relate to “general housekeeping”, “late night dinner”, “quick lunchtime shopper”, and perhaps other types that we are not expecting Yes David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  8. Quality of a clustering? Why is this: Better than this? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  9. Quality of a clustering? • A `good clustering’ has the following properties: • Items in the same cluster tend to be close to each other • Items in different clusters tend to be far from each other It is not hard to come up with a metric – an easily calculated value – that can be used to give a score to any clustering. There are many such metrics. E.g. S = the mean distance between pairs of items in the same cluster D = the mean distance between pairs of items in different clusters Measure of cluster quality is: D/S -- the higher the better. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  10. Let’s try that A B C D E F G H S = [AB + AD + AF + AH + BD + BF + BH + DF + DH + FH + CE + CG + EG] / 13 = 44/13 3.38 D = [ AC + AE + AG + BC + BE + BG + DC + DE + DG + FC + FE + FG + HC + HE + HG ]/15 = 40/15 = 2.67 Cluster Quality = D/S = 0.77 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  11. Let’s try that again A B C D E F G H S = [AB + AC + AD + BC+ BD + CD + EF + EG + EH + FG + FH + GH ] / 12 = 20/12 = 1.67 D = [ AE + AF + AG + AH + BE + BF + BG + BH + CE + CF + CG + CH + DE + DF + DG + DH]/16 = 68/16 = 4.25 Cluster Quality = D/S = 2.54 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  12. But what about this? A B C D E F G H S = [AB + CD + EF + EG + EH + FG + FH + GH ] / 8 = 12/8 = 1.5 D = [ AC + AD + AE + AF + AG + AH + BC + BD + BE + BF + BG + BH + CE + CF + CG + CH + DE + DF + DG + DH]/20 = = 72/20= 3.6 Cluster Quality = D/S = 2.40 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  13. Some important notes • There is usually no `correct’ clustering. • Clustering algorithms (whether or not they work with • cluster quality metrics) always use some kind of distance • or similarity measure -- the result of the clustering process • will depend on the chosen distance measure. • Choice of algorithm, and/or distance measure, will depend on the • kind of cluster shapes you might expect in the data. • Our D/S measure for cluster quality will not work well in lots • of cases David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  14. Examples: sometimes groups arenot simple to spot, even in 2D Slide credit: Julia Handl David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  15. Examples: sometimes groups arenot simple to spot, even in 2D Slide credit: Julia Handl David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  16. Brain Training Think about why D/S is not a useful cluster quality measure in the general case Try to design a cluster quality metric that will work well in the cases of the previous slides (not very difficult) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  17. In many problems the clusters are more `conventional’ – but maybe fuzzy and unclear Slide credit: Julia Handl David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  18. And there is a different kind of clusteringthat can be done, which avoids the issue ofdeciding the number of clusters in advance Slide credit: Elias Raftopoulos Prof. Maria Papadopouli David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  19. How to do it • The most commonly used methods: • K-Means • Hierarchical Agglomerative Clustering David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  20. K-Means • If you want to see K clusters, then run K-means. I.e. you need to choose in advance the number of clusters. Say K=3 -- run 3-means and the result is a `good’ grouping of the data into 3 clusters. • It works by generating K points (in a way, these are made-up records in the data); each point is the centre (or centroid) of one cluster. As the algorithm iterates, the points adjust their positions until they stabilise. • Very simple, fairly fast, very common; a few drawbacks. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  21. Let’s see it David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  22. Here is the data; we choose k = 2and run 2-means David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  23. We choose two cluster centres -- randomly David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  24. Step 1: decide which cluster each point is in – the one whose centre is closest David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  25. Step 2: We now have two clusters – recompute the centre of each cluster David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  26. These are the new centres David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  27. Step 1: decide which cluster each point is in – the one whose centre is closest David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  28. This one has to be reassigned: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  29. Step 2: We now have two new clusters – recompute the centre of each cluster David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  30. Centres now slightly moved David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  31. Step 1: decide which cluster each point is in – the one whose centre is closest David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  32. In this case, nothing gets reassigned to a new cluster – so the algorithm is finished David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  33. The K-Means Algorithm • Choose k • Randomly choose k points, labelled {1, 2, …, k} to be the initial cluster centroids. • For each datum, let its cluster ID be the label of its closest centroid. • For each cluster, recalculate its actual centre. • Go back to Step 2, stop when step 2 does not change the cluster ID of any point David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  34. Simple but often not ideal • Variable results with noisy data and outliers • Very large or very small values can skew the centroid positions, and give poor clusterings • Only suitable for the cases where we can expect clusters to be `clumps’ that are close together – e.g. terrible in the two-spirals and similar cases David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  35. Hierarchical Agglomerative Clustering Before we discuss this: • We need to know how to work out the distance between two points: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  36. Hierarchical Agglomerative Clustering Before we discuss this: • And the distance between a point and a cluster ? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  37. Hierarchical Agglomerative Clustering Before we discuss this: • And the distance between two clusters ? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  38. Hierarchical Agglomerative Clustering Before we discuss this: • There are many options for all of these things; we will discuss them in a later lecture. ? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  39. Hierarchical Agglomerative Clustering • Is very commonly used • Very different from K-means • Provides a much richer structuring of the data • No need to choose k • But, quite sensitive to the various ways of working out distance (different results for different ways) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  40. Let’s see it 1 5 7 9 3 4 8 2 6 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  41. Initially, each point is a cluster 1 5 7 9 3 4 8 2 6 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  42. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 7 9 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  43. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 7 9 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  44. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 7 9 8 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  45. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 7 9 8 2 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  46. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 1 7 9 8 2 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  47. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 1 7 9 8 2 5 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  48. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 1 6 7 9 8 2 5 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  49. Find closest pair of clusters, and merge them into one cluster 1 5 7 9 3 4 8 2 6 1 6 7 9 8 2 5 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

  50. Now all one cluster, so stop 1 5 7 9 3 4 8 2 6 1 6 7 9 8 2 5 3 4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

More Related