1 / 0

Data Mining Classification: Basic Concepts,

Data Mining Classification: Basic Concepts,. Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar. Classification: Definition. Given a collection of records ( training set ) Each record contains a set of attributes , one of the attributes is the class .

ervin
Download Presentation

Data Mining Classification: Basic Concepts,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Classification: Basic Concepts, Lecture Notes for Chapter 4 - 5 Introduction to Data Mining by Tan, Steinbach, Kumar
  2. Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes.
  3. Classification: Definition Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  4. Illustrating Classification Task Learning Algorithm Learn Model Model Apply Model
  5. Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying spam mail Classifying user in a social network Categorizing news stories as finance, weather, entertainment, sports, etc
  6. Classification Techniques Decision Tree Naïve Bayes Instance Based Learning Rule-based Methods Neural Networks Bayesian Belief Networks Support Vector Machines
  7. Classification: Measure the quality Usually the Accuracymeasureisused: Number of CorrectlyClassified record Accuracy = Total Number of record in the test set
  8. DecisionTree Uses a tree structure to model the training set Classifies a new record following the path in the tree Innernodesrepresentattributes and leavesnodesrepresent the class
  9. categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data
  10. NO Another Example of Decision Tree categorical categorical continuous class Single, Divorced MarSt Married NO Refund No Yes TaxInc < 80K > 80K YES NO There could be more than one tree that fits the same data!
  11. Decision Tree Classification Task Learning Algorithm Learn Model Model Decision Tree Apply Model
  12. Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree.
  13. Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data
  14. Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO
  15. Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO
  16. Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO
  17. Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO
  18. Decision Tree Classification Task Learning TreeAlgorithm Learn Model Model Decision Tree Apply Model
  19. Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 (J48 on WEKA) SLIQ,SPRINT
  20. Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
  21. Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
  22. How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split
  23. Marital Status Marital Status Marital Status {Divorced, Single} {Married, Single} Single Divorced Married Divorced Married Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. OR
  24. Size Large Small Medium Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Wecan imagine an attribute SIZE defined over the ordered set {Small, Medium, Large}
  25. Size Size Size {Medium, Large} {Small, Large} {Small, Medium} {Large} {Medium} {Small} Splitting Based on Ordinal Attributes Binary split: Divides values into two subsets. Need to find optimal partitioning. OR What about this split?
  26. Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive
  27. Splitting Based on Continuous Attributes
  28. Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
  29. Example Single, Divorced MarSt Married
  30. Example Single, Divorced MarSt Married NO Refund No Yes {NO:4, YES:0}
  31. NO Example Single, Divorced MarSt Married NO Refund No Yes {NO:4, YES:0} TaxInc >=80K < 80K {NO:2, YES:0}
  32. NO Example Single, Divorced MarSt Married NO Refund No Yes {NO:4, YES:0} TaxInc >=80K < 80K {NO:2, YES:0} YES NO {NO:1, YES:0} {NO:0, YES:3}
  33. How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity
  34. Measures of Node Impurity Given a node t GiniIndex Entropy Misclassification error
  35. Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting
  36. Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values
  37. Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets
  38. Naive Bayes Uses probabilitytheory to model the training set Assumes independencebetweenattributes Produces a model for each class
  39. Bayes Theorem Conditional Probability: Bayestheorem:
  40. Example of Bayes Theorem Given: A doctor knows thatmeningitiscauses headache 50% of the time Prior probability of any patient havingmeningitisis 1/50,000 Prior probability of any patient having headache is 1/20 If a patient has headache, what’s the probability he/she has meningitis ? (M = meningitis, S = headache)
  41. Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) Can we estimate P(C| A1, A2,…,An ) directly from data?
  42. Bayesian Classifiers Approach: compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem Choose value of C that maximizes P(C | A1, A2, …, An) Equivalent to choosing value of C that maximizes How to estimate P(A1, A2, …, An | C )?
  43. Naïve Bayes Classifier Assume independence among attributes Ai when class is given: Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj if is maximal.
  44. How to Estimate Probabilities from Data? Class: P(C) = Nc/N e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes:P(Ai | Ck) = |Aik|/ Nc where |Aik| is number of instances having attribute Ai and belongs to class Ck Examples: P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0
  45. How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins one ordinal attribute per bin violates independence assumption Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) k
  46. How to Estimate Probabilities from Data? Compute: P(Status=Married|Yes) = ? P(Refund=Yes|No) = ? P(Status=Divorced|Yes) = ? P(TaxableInc > 80K|Yes) = ? P(TaxableInc > 80K|NO) = ?
  47. How to Estimate Probabilities from Data? Compute: P(Status=Married|Yes) = 0/3 P(Refund=Yes|No) = 3/7 P(Status=Divorced|Yes) = 1/3 P(TaxableInc > 80K|Yes) = 3/3 P(TaxableInc > 80K|NO) = 4/7
  48. Example of Naïve Bayes Classifier Given a Test Record: P(Cj|A1, …, An ) = P(A1, …, An |Cj) P(Cj ) = P(A1| Cj)… P(An| Cj) P(Cj )
  49. Example of Naïve Bayes Classifier P(X|Class=No) = P(Refund=No|Class=No)P(Married| Class=No)P(Income>=80K| Class=No) = 4/7  4/7  4/7 = 0.1865 P(X|Class=Yes) = P(Refund=No| Class=Yes)P(Married| Class=Yes)P(Income>=80K| Class=Yes) = 1  0 1 = 0 P(X|No)P(No) = 0.1865 * 0.7 = 0.1306 P(X|Yes)P(Yes) = 0 * 0.3 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X)=> Class = No
  50. Example of Naïve BayesClassifier(2) Given a Test Record: P(A1, A2, …, An |C) P(C ) = P(A1| Cj) P(A2| Cj)… P(An| Cj) P(C )
  51. Example of Naïve BayesClassifier(2) P(X|Class=No) = P(Refund=No|Class=No)P(Single| Class=No)P(Income>=80K| Class=No) = 4/7  2/7  4/7 = 0.0933 P(X|Class=Yes) = P(Refund=No| Class=Yes)P(Single|Class=Yes)P(Income>=80K|Class=Yes) = 1 2/3 1 =0.666 P(X|No)P(No) = 0.0933 * 0.7 = 0.06531 P(X|Yes)P(Yes) = 0.666* 0.3 = 0.08571 Since P(X|No)P(No) < P(X|Yes)P(Yes) Therefore P(No|X) < P(Yes|X)=> Class = Yes
  52. Naïve Bayes (Summary) Robust to isolated noise points Model each class separately Robust to irrelevant attributes Use the whole set of attribute to perform classification Independence assumption may not hold for some attributes
  53. Ensemble LearningRoberto Esposito and Dino Iencodino.ienco@teledetection.frhttp://www2.lirmm.fr/~ienco/Dino_Ienco_Home_Page/Index.html

    Acknowledgements Most of the material is based on Nicholaus Kushmerick’s slides.You can find his original slideshow at:www.cs.ucd.ie/staff/nick/home/COMP-4030/L14,15.pptSeveral pictures are taken from the slides by Thomas Dietterich.You can find his original slideshow (see slides about Bias/Variance theory) at:http://web.engr.oregonstate.edu/~tgd/classes/534/index.html
  54. Agenda 1. What is ensemble learning 2. (Ensemble) Classification: 2.1. Bagging 2.2. Boosting 2.3 Why do Ensemble Classifiers Work? 3. (Ensemble) Clustering: Cluster-basedSimilarityPartitioningAlgorithm (CSPA) HyperGraph-PartitioningAlgorithm (HGPA) Somehints on how to build base clusters
  55. part 1. What is ensemble learning? Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions [Freund & Schapire, 1995]
  56. Ensemble learning different training sets and/or learning algorithms
  57. How to make an effective ensemble? Twobasicdecisionswhendesigningensembles: How to generate the base model? h1, h2, … How to integrate/combine them? F(h1(x), h2(x), …)
  58. Ensemble Classification Ensemble of Classifier: How to generate a classifier in an ensemble schema (whichis the training set)? How to combine the vote of each classifier? Why do ensemble Ensemble Classifiers do?
  59. Question 2: How to integrate them Usually take a weighted vote: ensemble(x) = f( i wi hi(x) ) wi is the “weight” of hypothesis hi wi > wj means “hi is more reliable than hj” typically wi>0 (though could have wi<0 meaning “hi is more often wrong than right”) (Fancier schemes are possible but uncommon)
  60. Question 1: How to generate base classifiers Lots of approaches… A. Bagging B. Boosting …
  61. BAGGing= BootstrapAGGregation (Breiman, 1996) for i = 1, 2, …, K: Ti randomly select M training instances with replacement hi learn(Ti) [ID3, NB, kNN, neural net, …] Now combine the hi together with uniform voting (wi=1/K for all i)
  62. decision tree learning algorithm; along the lines of ID3
  63. shades of blue/red indicate strength of vote for particular classification
  64. Boosting Bagging was one simple way to generate ensemble members with trivial (uniform) vote weighting Boosting is another…. “Boost” as in “give a hand up to” suppose A can learn a hypothesis that is better than rolling a dice – but perhaps only a tiny bit better Theorem: Boosting A yields an ensemble with arbitrarily low error on the training data! ensemble error rate 50% error rate of flipping a coin 49% error rate of A by itself 1 2 3 4 5 6 …. 500 size of ensemble time
  65. Boosting Idea: assign a weight to every training set instance initially, all instances have the same weight as boosting proceedgs, adjusts weights based on how well we have predicted data points so far- data points correctly predicted low weight- data points mispredicted  high weight Results: as learning proceeds, the learner is forced to focus on portions of data space not previously well predicted
  66. Time=0 blue/red = class size of dot = weight hypothesis = horizontal or vertical line
  67. Time=1 The WL error is 30% The ensemble error is 30% (note T=1)
  68. Time=3
  69. Time=7
  70. Notice the slope of the weak learner error: AdaBoost creates problems of increasing difficulty. Time=21
  71. Look, the training error is zero. One could think that we cannot improve the test error any more. Time=51
  72. But… the test error still decreases! Time=57
  73. Time=110
  74. AdaBoost (Freund and Schapire) binary class y  {0,1} = 1/N normalize wt to get aprobability distribution ptI pti = 1 [0,1] penalize mistakes onhigh-weight instances more if ht correctly classify xi multiply weight by t < 1 otherwise multiple weight by 1 weighted vote, with wt = log(1/t)
  75. Learning from weighted instances? One piece of the puzzle missing… So far, learning algorithms have just taken as input a set of equally important learning instances. [0,1] Reweighting What if we also get a weight vector saying how important each instance is? Turns out.. it’s very easy to modify most learning algorithms to deal with weighted instances: ID3: Easy to modify entropy, information-gain equations to take into consideration the weights associated tothe examples, rather than to take into account only the count (which simply assumes all weights=1) Naïve Bayes: ditto k-NN: multiple vote from an instance by its weight
  76. Learning from weighted instances? Resampling As an alternative to modify learning algorithms to support weighted datasets, we can build a new dataset which is not weighted but it shows the same properties of the weighted one. Let L’ be the empty set Let (w1,..., wn)be the weights of examples in L sorted in some fixed order (we assume wi corresponds to example xi) Draw n[0..1] according to U(0,1) set L’L’{xk} where kis such that if enough examples have been drawn return L’ else go to 3
  77. Learning from weighted instances? How many examples are “enough”? The higher the number, the better L’ approximate a dataset following the distribution induced by W. As a rule of thumb: |L’|=|L| usually works reasonably well. Why don’t we always use resampling instead of reweighting? Resampling can be always applied, unfortunately it requires more resources and produces less accurate results. One should use this technique only when it is too costly (or unfeasible) to use reweighting.
  78. Why do ensemble classifiers work? 3 1 2 [T. G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857:1–15, 2000.]
  79. 1. Statistical Given a finite amount of data, many hypothesis are typically equally good. How can the learning algorithm select among them? Optimal Bayes classifier recipe: take a weighted majority vote of all hypotheses weighted by their posterior probability. That is, put most weight on hypotheses consistent with the data. Hence, ensemble learning may be viewed as an approximation of the Optimal Bayes rule (which is provably the best possible classifier).
  80. 2. Representational The desired target function may not be implementable with individual classifiers, but may be approximated by ensemble averaging Suppose you want to build a decision boundary with decision trees The decision boundaries of decision trees are hyperplanes parallel to the coordinate axes. By averaging a large number of such “staircases”, the diagonal decision boundary can be approximated with arbitrarily good accuracy
  81. 3. Computational All learning algorithms do some sort of search through some space of hypotheses to find one that is “good enough” for the given training data Since interesting hypothesis spaces are huge/infinite, heuristic search is essential (eg ID3 does greedy search in space of possible decision trees) So the learner might get stuck in a local minimum One strategy for avoiding local minima: repeat the search many times with random restarts bagging
  82. Reading Dietterich: Ensemble methods in machine learning (2000). Schapire: A brief introduction to boosting (1999). [Sec 1-2, 5-6] Dietterich & Bakiri: Solving multiclass learning problems via error-correcting output codes (1995). [Skim]
  83. Summary… Ensembles Classifiers: basic motivation – creating a committee of experts is more effective than trying to derive a single super-genius Key issues: Generation of base models Integration of base models Popular ensemble techniques manipulate training data: bagging and boosting(ensemble of “experts”, each specializing on different portions of the instance space) Current Research: Ensemble pruning (reduce the number of classifiers, select only non redundant and informative ones) versus
  84. Ensemble Clustering Ensemble of Clustering (Partition): How to formulate the problem? How to define how far (or similar) are twodifferentclustering solutions? How to combine different partitions? (CSPA, How to generatedifferentparitions?
  85. Example We have a dataset of 7 examples and werun 3 clusteringalgorithm and weobtain 3 differentclustering solution: C1 = (1,1,1,2,2,3,3) C2 = (1,1,2,2,3,3,3) C3 = (2,2,2,3,3,1,1) Eachclustering solution isrepresented as a vectorwith as manycomponenet as the number of original example (7) and associate to each component we have the label of correspondingclustering. How much information issharedamong the different partitions? How we combine them?
  86. How to formulate the problem GOAL: Seek a final clusteringthatshares the most information with the original clusterings. Find a partition thatshare as much information as possible with all individualclusteringresults If we have N clusteringresultswewantobtain a solution suchthat:
  87. How to formulate the problem Where Ci is a clustering solution and Coptis the optimal solution so the problem to the clustering ensemble problem and phi is a measure able to evaluate the similaritybetween 2 clusteringresults.
  88. How to definesimilaritybetween partitions How wecandefine the function We use the NormalizedMutual Information (original paper: Cluster Ensembles – A KnowledgeReuse Framework for Combining Multiple Partitions, Alexander Strehl and JoydeepGhosh) Normalized version of Mutual Information Mutual Information usuallyinvolved to evalutecorrelationbetweentworandom variables
  89. How to combine different partitions?Cluster-basedSimilarityPartitioningAlgorithm (CSPA) Simple approach: Given an ensemble of partitions: - if M is the number of objects, build a M x M matrix - eachcell of the matrixcontains how many times 2 objectsco-occurtogether in a cluster - the matrixcanbeseen as object-objectsimilaritymatrix - re-cluster the matrixwith a clusteringalgorithm (in the original papertheyapply METIS, graph-basedclusteringapproach, to re-cluster the similaritymatrix
  90. Example (I) Given the different partition of the dataset: C1 = (1,1,1,2,2,3,3) C2 = (1,1,2,2,3,3,3) C3 = (2,2,2,3,3,1,1) Weobtainthisco-occurrencematrix:
  91. Example (II) Weobtainthisco-occurrencematrix: Weobtainthis graph of co-occurrence: X2 2 X4 X6 1 X3 1 3 2 3 2 2 1 X1 X5 X7
  92. How to combine different partitions?HyperGraph-PartitioningAlgorithm (HGPA) Hypergraph-basedapproach: Given an ensemble of partitions: - each partition isseen as an edge in the hypergraph - eachobjectis a node - all nodes and hyperdeges have the sameweight - try to eliminatehyperedges to obtain K unconnected partitions withapproximately the same size - Apply standard techniques to partition hypergraph
  93. Hypergraph An hypergraphisdefined as: a set of nodes a set of hyperedges over nodes Eachedgeis not a relation between 2 nodes, but itis a relation among more than 1 nodes Hyperedge: isdefined as set of nodes in some relations eachother
  94. Example Given the different partition of the dataset: C1 = (1,1,1,2,2,3,3) C2 = (1,1,2,2,3,3,3) C3 = (2,2,2,3,3,1,1) The datasetiscomposed by 7 examples Weobtainthishypergraph: X2 X3 X4 X6 X1 X5 X7
  95. How to generatedifferentparitions (I) Ways to generatedifferent partitions: 1) Using the sameclusteringalgorithm: a) fix the dataset and change the parameter values in order to obtaindifferentresults. b) Produce as manyrandom projection of the original dataset as the number of partitions needed and then use the clusteringalgorithm to obtain a partition over eachprojecteddataset.
  96. How to generatedifferentparitions (II) Ways to generatedifferent partitions: 2) Usingdifferentclusteringalgorithms: a) fix the dataset and run as manyclusteringalgorithms as youneed. If the number of algorithmsavailableissmallerthan the number of partition needed, re-run the samealgorithmswithdifferent parameters. b) Producesomerandom projections of the dataset and thenapply the differentclusteringalgorithms.
  97. Reading Strehl and Ghosh: Cluster Ensembles – A KnowledgeReuse Framework for Combining Multiple Partitions(2002). Al-Razgan and Domeniconi: WeightedClustering Ensembles (2006). Fern and Lin: Cluster Ensemble Selection(2008).
  98. Summary… Ensembles Clustering: basic motivation –No single clusteringalgorithmcanadequatelyhandle all types of cluster shapesand structures Key issues: How to choose an appropriatealgorithm? How to interpretdifferent partitions produced by differentclusteringalgorithms? More difficultthan Ensemble Classifiers Popular ensemble techniques Combine partitions obtained in different way using similarity between partitions or co-occurrence Current Research: Cluster Selection (reduce the number of clusters, select only non redundant and informative ones) versus
More Related