1 / 50

Knowledge Enhanced Clustering

Knowledge Enhanced Clustering. Clustering “Find the Groups of Similar Things”. Height. Find the set partition (or hyperplanes) that minimize some objective function. Weight. Clustering “Find the Groups of Similar Things”. Height. Find the set partition (or hyperplanes) that

karma
Download Presentation

Knowledge Enhanced Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge Enhanced Clustering

  2. Clustering “Find the Groups of Similar Things” Height Find the set partition (or hyperplanes) that minimize some objective function Weight

  3. Clustering “Find the Groups of Similar Things” Height • Find the • set partition (or • hyperplanes) that • minimize some objective • function • ArgminCiD(Cf(S_i)- s_i) Weight

  4. K Means Example (k=2)Initialize Centroids Height x x Weight

  5. K Means ExampleAssign Points to Clusters Height x x Weight

  6. K Means ExampleRe-estimate Centroids Height x x Weight

  7. K Means ExampleRe-assign Points to Clusters Height x x Weight

  8. K Means ExampleRe-estimate Centroids Height x x Weight

  9. K Means ExampleRe-assign Points to Clusters Height x x Weight

  10. K Means ExampleConverge Height x x Weight

  11. K Means ExampleConvergence Height x x Weight Greedy algorithm. Produces useful results, linear time per iteration

  12. Where Data Driven Clustering Fails: a) Pandemic Preparation[Davidson and Ravi 2007a] • In collaboration with Los-Alamos/Virginia Tech Bio-informatics Institute • VBI Micro simulator based on census data, road network, buildings etc. • Ideal to model pandemics due to bird flu, bio-terrorism. • Problem: Find spatial clusters of households that have a high propensity to be infected or not infected. • Currently at city level (million households), but soon the eastern seaboard, entire country.

  13. Portland Pandemic Simulation

  14. Portland Pandemic Simulation • Typical results are shown in the • left. • Not particularly useful for • containment policy design because: • Some regions are too large • Uneven distribution of key • facilities such as hospitals/school

  15. Lane-level navigation (e.g., advance notification for taking exits) Lane-keeping suggestions (e.g., lane departure warning) Another Problem: b) Automatic Lane Finding from GPS Traces [Wagstaff, Langley et al. ’01]

  16. Mining GPS Traces • Instances are the x,y location on the road.

  17. Mining GPS Traces • Instances are the x,y location on the road. This is a very good local minimum of the algorithm’s objective function

  18. Another Example: c) CMU Faces Database [Davidson, Wagstaff, Basu, ECML 06] Useful for biometric applications such as face recognition etc.

  19. Typical But Not Useful Clusters For Our Purpose

  20. Limitations of Data Driven Clustering at a High Level • Objective functions were reasonably minimized. • Hoping patterns are “novel and actionable” is a long-shot. • Problem: Find a general purpose and principled way to encode knowledge into the many data mining algorithms. • Bayesian approach?

  21. Outline • Knowledge enhanced mining with constraints • Motivation • How to add in domain expertise • Complexity results • Sufficient conditions and algorithms • Other work potentially applicable to sky survey data • Speeding up algorithms by scaling down data • Mining poor quality data • Mining with biased data sets

  22. What Type of Knowledge Do We Want To Represent Explicit: Points more than 3 metres apart along y-axis must be in different clusters > 3 metres Implicit: The people in two images have similar or dissimilar features.

  23. Representing Knowledge With Constraints • Clustering is finding a set partition • Must-Link (ML) Constraints • Explicit: Points siand sj(i  j) must be assigned to the same cluster. Equivalence relation. • Implicit: siand sjare similar • Cannot-Link (CL) Constraints • Explicit: Points siand sj(i  j) can not be assigned to the same cluster.Symmetrical. • Implicit: siand sjare different • Any partition can be expressed a collection of ML and CL constraints

  24. Unconstrained Clustering Example (Number of Clusters=2) Height Weight

  25. Unconstrained Clustering Example (Number of Clusters=2) Height x x Weight

  26. Unconstrained Clustering Example (Number of Clusters=2) Height x x Weight

  27. Cannot-link Must-link Constrained Clustering Example (Number of Clusters=2) Height x x Weight

  28. Cluster Level Constraints • Useful decision regions have: • Cluster diameters at most  • Conjunction of cannot-links between points whose distance is greater than  • Clusters must be  distance apart • Conjunction of must-links between all points whose distance is less than    Don’t need all constraints. Davidson, Wagstaff, Basu 2006a Discusses a useful subset

  29. Constraint Language To Express Knowledge Pandemic Results Example FeaturesApart (Elevation= High, Elevation=Low) NotMoreThanCTogether(2, School1, School2 … Schooln)

  30. Can Also Use Constraints to Critique (Give Feedback) • Feedback  incrementally specifying constraints • Positive feedback • Negative feedback • ML(x,y)  Not(CL(x,y)) • Do not re-run mining algorithm again • Efficiently modify existing clustering to satisfy feedback (Joint work with Martin Ester, S.S. Ravi, and Mohammed Zaki)

  31. Outline • Knowledge enhanced mining with constraints • Motivation • How to add in domain expertise • Complexity results • Sufficient conditions and algorithms • Other work potentially applicable to sky survey data • Speeding up algorithms by scaling down data • Mining poor quality data • Mining with biased data sets

  32. Complexity Results: Can We Design Efficient Algorithms • Unconstrained problem version: • ArgminCiD(Cf(s_i)- s_i) f(s_i) is the cluster identify function

  33. Complexity Results: Can We Design Efficient Algorithms • Constrained problem version: • ArgminCiD(Cf(s_i)- s_i) • s.t. (i,j)  ML : f(s_i) = f(s_j), (i,j)  CL : f(s_i)  f(s_j) • Feasibility sub-problem • i.e. No solution for k=2: CL(x,y), CL(x,z), CL(y,z) • Important: Relates to generating a feasible clustering x z y

  34. Clustering Under Cannot Link Constraints is Graph Coloring Instances a thru z Constraints: ML(g,h) CL(a,c), CL(d,e), CL(f,g), CL(c,g), CL(c,f) f c g,h a Graph k-coloring problem d e

  35. Sample of Feasibility Problem Complexity Results: Not So Bad [Bounded k: non-hierarchical clustering Davidson, Ravi Journal of DMKD in press] [Unbounded k: hierarchical clustering]

  36. Other Implications of Results For Algorithm Design: Getting Worse[Davidson and Ravi 2007b] • Algorithm design idea: • Find the best clustering  that satisfies most constraints in C. • Can’t be done efficiently: • Repair  to satisfy C. • Minimally prune C to satisfy 

  37. Incrementally Adding In Constraints: Quite Bad [Davidson, Ester and Ravi 2007c] • User-centric mining • Given a clustering  that satisfies a set of constraints C • Minimally modifying  to satisfy C and just one more ML or CL constraint is intractable.

  38. Outline • Knowledge enhanced mining with constraints • Motivation • How to add in domain expertise • Complexity results • Sufficient conditions and algorithms • Other work potentially applicable to sky survey data • Speeding up algorithms by scaling down data • Mining poor quality data • Mining with biased data sets

  39. Interesting Phenomena – CL Only [Davidson et’ al DMKD Journal, AAAI06] Cancer No feasibility Issues Phase- Transitions? [Wagstaff, Cardie 2002]

  40. Satisfying All Constraints(Cop-k-Means) [Wagstaff, Thesis 2000] Algorithm aims to minimize VQE while satisfying all constraints. • Calculate the transitive closure over ML points. • Replace each connected component with a weighted single point. • Randomly generate cluster centroids. • Begin Nearest feasible centroid assignment Calculate centroids • Loop until VQE is small

  41. Cannot-link Must-link COP-K-Means: Nearest Feasible Centroid Assignment Height 4. Nearest feasible centroid assignment x x Weight

  42. Why The Algorithm Fails • Explanation: Order Instances Are Processed in Can be clustered for k=2 But consider instance ordering: abc (1), hi (1), de (2), jk (?) x x x x

  43. Why The Algorithm Fails • Explanation: Instance Ordering Can be clustered for k=2 But consider instance ordering: abc (1), hi (1), de (2), jk (?) x x x x

  44. Why The Algorithm Fails • Explanation: Instance Ordering Can be clustered for k=2 But consider instance ordering: abc (1), hi (1), de (2), jk (?) x x x x

  45. Why The Algorithm Fails • Explanation: Instance Ordering • Question: Is there a sufficient condition for any ordering of the points so an algorithm will converge. Can be clustered for k=2 Instance ordering: abc (1), hi (1), de (2), jk (?) x x x x

  46. Why The Algorithm Fails • Explanation: Instance Ordering • Question: Is there a sufficient condition for any ordering of the points so an algorithm will converge. • Yes. Brooks’s Theorem: If k+ 1. Restrict constraint language so that most CL constraints on a point is less than k (number of clusters). Can be clustered for k=2 Instance ordering: abc (1), hi (1), de (2), jk (?) x x x x

  47. We Can Also Reorder Points To Make Some Problem Instances “Easy” [Davidson et’ al AAAI 2006] • [Irani 1984]- q-inductiveness of a graph • Theorem: If G(V,E) is q-inductive, G can be clustered • Any algorithm that processes the points in reverse order will always find a feasible solution. Brooks’s Thm.: k=4 1-Inductive Ordering {fg, l, abc, hi, jk, de} x x x fg l abc hi jk de x

  48. Assignment #2

  49. Assignment #2

  50. Assignment #2

More Related