300 likes | 454 Views
Decision Tree Classifiers. Abhishek Agarwal CS 5368: Intelligent Systems Fall 2010. Paradigms of learning Reinforcement Learning The agent receives some evaluation of actions (like fine for stealing bananas) but is not told the correct action (such as how to buy bananas)
E N D
Decision Tree Classifiers AbhishekAgarwal CS 5368: Intelligent Systems Fall 2010
Paradigms of learning • Reinforcement Learning • The agent receives some evaluation of actions (like fine for stealing bananas) but is not told the correct action (such as how to buy bananas) • Unsupervised Learning • Agent learns patterns in the input even though no explicit feedback is supplied. Data Mining. • Supervised Learning • Both Inputs and outputs are given. • Agent learns a function that maps from input to output. Input output: (x1, y1) (x2, y2) (x3, y3) ….…… (xn, yn) where each yi is generated by an unknown function y=f(x) Our goal: Discover function h that approximates the true function f
We can say that we have x,y values, therefore we can map values directly and get y for any x. Why do we need a hypothesis? • Do we know all x,y pairs possible in the domain? • What happens if we do? • Where is the learning part? What is classification? Hypothesis. Consistent and Inconsistent. Ockham’s Razor - Choose the simplest hypothesis consistent with the data. Why Simplest?
Decision Tree Induction One of the simplest and most successful forms of machine learning. Takes as input a vector of attribute values. Returns a single output value decision. Decision - To wait for a table at a restaurant or not? Goal – To come up with a function, which gives a boolean output WillWait. Attributes: • Alternate: Whether there is a suitable alternative restaurant nearby. • Bar: Whether the restaurant has a comfortable waiting lounge. • Fri/Sat: Is it a Friday Saturday. • Hungry: Whether we are hungry. • Patrons: How many people are in the restaurant (values are None, Some, Full) • The restaurant’s pricing range($, $$, $$$) • Raining: Whether it is raining outside. • Type: The kind of restaurant(French, Italian, Thai, Burger) • Reservation: Whether we made a reservation • WaitEstimate: The wait estimated by the waiter (0-10, 10-30, 30-60 or 60>)
Observations.. • It represents a human like thinking pattern. We take different attributes into consideration one by one and arrive at a conclusion for many problems. • A decision tree reaches a conclusion by performing a series of tests. • Each internal node in the tree corresponds to a test of the value of an attribute. • The branches from the nodes represent possible values of the attributes. • Each leaf node represents the final value to be returned by the function. • Price and type attributes are ignored? Why, we will see later.
Decision Tree – Supervised LearningWhat is the form of supervision? Ans. Examples of input output pairs
Classification of examples is positive (T) or negative (F) • Did you notice? • We need to come up with a short and succinct tree based on these examples. • We do not want to check every attribute for making every decision. • Many attributes can become don’t cares in may decision situations. • These don’t cares should be utilized to reduce the size of the decision tree. • If we see that a restaurant is full, then we do not care about any other attribute • The sample set is incomplete and maybe noisy.
Expressiveness of Decision Trees The hypothesis space for decision trees is very large. Consider the set of all Boolean functions on n attributes. How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n • E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees More expressive hypothesis space • increases chance that target function can be expressed • increases number of hypotheses consistent with training set may get worse predictions
Learning Algorithm • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree functionDTL(examples, attributes, parents_examples) returns a decision tree { if examples is empty thenreturn MAJORITY_VALUE(parent_examples) else if all examples have all same classification then return the classification else if attributes is empty then return MAJORITY_VALUE(examples) else best CHOOSE_BEST_ATTRIBUTE(attributes, examples) Tree a new decision tree with root test best For each value vi of bestdo examplesi { elements of examples with best = vi } subtree DTL( examplesi, attributes – best, MAJORITY_VALUE(examples)) add a branch to the tree with label viand subtreesubtree returnTree }
let’s consider we have chosen patron as the root attribute. It has split the examples into 3 new cases. Each case or outcome is a new decision tree learning problem in itself with fewer examples and one less attribute. This leaves us with 4 cases to ponder on when we will be constructing the sub-tree. • If the remaining examples are all positive or all negative, then we have an answer yes/no. Leaf. • If we have some positive or some negative values then we can choose the next best attribute to split them. A new internal node. E.g. Hungry. • If there are no examples left, it means that no examples have been observed for this combination. We return a yes/no (leaf) based on the majority value at the parent node. • If there are no attributes left but both positive and negative examples left, it means that these examples have exactly the same description but different classification. This represents error or noise in data. Return the majority value of the examples.
Choosing an attribute Greedy Part Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative“. Minimize tree depth. Patrons? A better choice
Information Theory Why? We need a formal measure of good and bad attributes. Formal measure = Information Gain defined in terms of entropy. What is Entropy? Information content
Entropy is a measure of information or uncertainty of a random variable. More the uncertainty, more the information, thus higher the entropy. For example if we toss a coin which always falls on head, we gain no information by tossing it, so zero entropy. However if we toss a fair coin, we are unsure of the outcome. So we get some information out of tossing it. It has entropy one bit. • Entropy: H(V) = ∑ P(Vk) log2(1/P(Vk)) = -∑ P(Vk) log2P(Vk) • For a fair coin H(Fair) = - (0.5log2 0.5 + 0.5 log2 0.5) = 1 • For a Biased coin H(Biased) = - 1 log2 (1) = 0
Applying Information Theory • Measure Information Content in the beginning • Measure the change in Information Content an attribute brings • Which attribute brings the MAX change? Eureka. Finding Max Information Gain – Lets start For a training set containing p positive examples and n negative examples:
A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values. • Information Gain (IG) or reduction in entropy from the attribute test: • Choose the attribute with the largest Gain Information Gain
Information Gain - Example For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root
Example continued Decision tree learned from the 12 examples: Substantially simpler tree---a more complex hypothesis isn’t justified by small amount of data
Performance measurement How do we know that h ≈ f ? • Use theorems of computational/statistical learning theory • Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size
Is it that simple? Overfitting Overfitting is a scenario where the DECISION_TREE_LEARNING_ALGORITHM will generate a large tree when there is actually no pattern to be found. Problem with all types of learners. Reasons for overfitting • Some attributes have little meaning • Number of attributes more • Small training data set
Combating Overfitting – Pruning It works by eliminating nodes that are not clearly relevant. • We look at the test nodes that has only leaf nodes as descendents. • If the test appears to be irrelevant – detecting only noise in the data- then we eliminate the test, replacing it with a leaf node. • We repeat this process, considering each test with only leaf nodes as descendents until each one has either been pruned or accepted as is.
Question? How do we detect that a node is testing an irrelevant attribute? • Suppose we are at a node consisting of p positive and n negative examples. • If the attribute is irrelevant we would expect that it would split the examples into subsets that have roughly the same proportion of positive examples as the whole set p/(p+n). The information gain is close to zero. • Information gain is a good clue to irrelevance.
Problems associated with Decision Trees. Missing Data Multi valued attributes Continuous and integer valued input attributes Continuous valued output attributes
References • Artificial Intelligence – A modern approach –Third edition by Russel and Norvig. • Video Lecture – Prof. P. Dasgupta, Dept. of Computer Science, IIT Kharagpur.