750 likes | 897 Views
Announcements. No reading assignment for next week Prepare for exam Midterm exam next week. Last Time Neural nets (briefly) Decision trees Today More decision trees Ensembles Exam Review Next time Exam Advanced ML topics. A 2. + 3. + 4. + 5. + 1. + 2. - 1. - 2. - 3. + 3.
E N D
Announcements • No reading assignment for next week • Prepare for exam • Midterm exam next week
Last Time • Neural nets (briefly) • Decision trees • Today • More decision trees • Ensembles • Exam Review • Next time • Exam • Advanced ML topics
A2 +3 +4 +5 +1 +2 -1 -2 -3 +3 +4 +5 A1 A1 A1 A1 A3 A3 A3 A3 A4 A4 A4 A4 A1 A3 -1 -2 A1 A3 Overview of ID3 Use Majority class at parent node *NULL* Splitting Attribute +1 +2 +3 +4 +5 ID3 + -1 -2 -3 - A1 A2 A4 A3 A4 + - Splitting Attribute
Color Shape Size Class Red BIG + Blue BIG + Red SMALL - Yellow SMALL - Red BIG + Example Info Gain Calculation
Info Gain Calculation (contd.) Note that “Size” provides complete classification.
Runtime Performance of ID3 • Let E = # examples F = # features • At level 1 Look at each feature Look at each ex (to get feature value) Work to choose 1 feature = O(F x E)
Runtime Performance of ID3 (cont.) • In worst case, need to consider all features along all paths (full tree) O(F2 x E) Reasonably efficient
COLOR ? Green Blue - SIZE ? Big + Small + - Generating Rules • Antecedent: Conjuction of all decisions leading to terminal node • Consequent: Label of terminal node • Example Red
Generating Rules (cont.) • Generates rules: Color=Green - Color=Blue + Color=Red and Size=Big + Color=Red and Size=Small - • Note: 1. Can “clean up” the rule set (see Quinlan’s) 2. Decision trees learn disjunctive concepts
Noise-A Major Issue in ML • Worst Case +, - at same point in feature space • Causes 1. Too few features (“hidden variables”) or too few possible values 2. Incorrectly reported/measured/judged feature values 3. mis-classified instances
+ + + + - + + + + + - + - - - - - - - - - - Noise-A Major Issue in ML (cont.) • Issue – overfitting Producing an “awkward” concept because of a few “noisy” points. Bad performance on future ex’s? Better performance?
Overfitting Viewed in Terms of Function-Fitting Data = Red Line + Noise Model + + + + + + + + + + + + + + f(x) x
Training set accuracy of S Training set accuracy of C > but Test set accuracy of S Test set accuracy of C < Definition of Overfitting • Assuming large enough test set so that it is representative. Concept C overfit the training data if there exists a “simpler” concept S so that
Remember! • It is easy to learn/fit the training data • What’s hard is generalizing well to future (“test set”) data! • Overfitting avoidance is a key issue in Machine Learning
Can One Underfit? • Sure, if not fully fitting the training set -eg, just return majority category (+ or -) in the trainset as the learned model. • But also if not enough data to illustrate the important distinctions.
ID3 & Noisy Data • To avoid overfitting, allow splitting to stop before all ex’s are of one class. • Option 1: if info left < E, don’t split -empirically failed; bad performance on error-free data (Quinlan)
ID3 & Noisy Data (cont.) • Option 2: Estimate if all remaining features are statistically independent of the class of remaining examples -uses “chi test” of original ID3 paper -works well on error-free data
ID3 & Noisy Data (cont.) • Option 3: (not in original ID3 paper) Build complete tree, then use some “spare” (tuning) examples to decide which parts of tree can be pruned.
ID3 & Noisy Data (cont.) • Pruning is currently the best choice—see c4.5 for technical details • Repeat using greedy algo.
best Stop if no improvement Greedily Pruning D-trees • Sample (Hill Climbing) Search Space
+ Pruning by Measuring Accuracy on Tune Set • Run ID3 to fully fit TRAIN’ Set, measure accuracy on TUNE • Consider all subtrees where ONE interior node removed and replaced by leaf -label with majority category in pruned subtree choose best subtree on TUNE if no improvement, quit 3. Go to 2
R A B C D F E Initial The Tradeoff in Greedy Algorithm • Efficiency vs Optimality Eg IF “Tune” best cuts is to discard C’s & F’s subtrees BUT The single best cut is too discard B’s subtrees Greedy Search will not find best tree Greedy Search: Powerful, General Purpose, Trick – of - Trade
R [64] Accuracy if we replace this node with a leaf (leaving rest of the tree the same) A B [89] [77] C [88] D [63] F [87] E [74] Pruning @ B works best Full-Tree Accuracy = 85% on TUNE set Hypothetical Trace of a Greedy Algorithm
R [64] A B [89] [77] Hypothetical Trace of a Greedy Algorithm (cont.) • Full-Tree Accuracy = 89% - STOP since no improvement by cutting again, and return above tree.
Another Possibility: Rule Post-Pruning(also greedy algoritm) • Induce a decision tree • Convert to rules (see earlier slide) • Consider dropping one rule antecedent • Delete the one that improves tuning set accuracy the most. • Repeat as long as progress being made.
Rule Post-Pruning (Continue) • Advantages • Allows an intermediate node to be pruned from some rules but retained in others. • Can correct poor early decisions in tree construction. • Final concept more understandable.
Training with Noisy Data • If we can clean up the training data, should we do so? • No (assuming one can’t clean up the testing data when the learned concept will be used). • Better to train with the same type of data as will be experienced when the result of learning is put into use.
Overfitting + Noise • Using the strict definition of overfitting presented earlier, is it possible to overfit noise-free data? • In general? • Using ID3?
Example of Overfitting of Noise-free Data Let • Correct concept = A ^ B • Feature C to be true 50% of the time, for both + and – examples • Prob(+ example) = 0.9 • Training Set: • +: ABCDE, ABC¬DE, ABCD¬E • -: A¬B¬CD¬E, ¬AB¬C¬DE
Example (Continued) Tree Trainset Accuracy TestSet Accuracy ID3’s 100% 50% Simpler “tree” 60% 90% C F T + - +
Post Pruning • There are more sophisticated methods of deciding where to prune than simply estimating accuracy on a tuning set. • See the C4.5 and CART books for details. • We won’t discuss them, except for MDL • Tuning sets also called • Pruning sets (in d-tree algorithms) • Validation sets (in general)
Tuning Sets vs MDL • Two ways to deal with overfitting • Tuning Sets • Empirically evaluate pruned trees • MDL (Minimal Description Length) • Theoretically evaluate/score pruned trees • Describe training data in as few bits as possible (“compression”)
MDL (continue) • No need to hold aside training data • But how good is the MDL hypothesis? • Heuristic: MDL => good generalization
The Minimal Description Length (MDL) Principle (Rissanen, 1986; Quinlan and Rivest, 1989) • Informally, we want to view a training set as data = general rule + exceptions to the rule (“noise”) • Tradeoff between • Simple rule, but many exceptions • Complex rule with few exceptions • How to make this tradeoff? • Try to minimize the “description length” of the rule + exceptions
Trading Off Simplicity vs Coverage A weighting factor, user-defined or use tuning set Description Length Size of Rules Size of Exceptions = + λ x # bits needed to represent a decision tree that covers (possibly incompletely) the training examples # bits needed to encode the exceptions to this decision tree minimize • Key issue: what’s the best coding strategy to use?
A Simple MDL Algorithm • Build the full tree using ID3 (and all the training examples) • Consider all/many subtrees, keeping the one that minimizes: • score = (# nodes in tree) + λ * (error rate on training set) (A crude scoring function) Some details: If # features = Nf and # examples = Ne then need Ceiling(log2Nf) bits to encode each tree node and Ceiling (log2Ne) bits to encode an exception.
Searching the Space of Pruned D-trees with MDL • Can use same greedy search algorithm used with pruning sets • But use MDL score rather than pruning set accuracy as the heuristic function
MDL Summarized The overfitting problem • Can exactly fit the training data, but will this generalize well to test data? • Tradeoff some training-set errors for fewer test-set errors • One solution – the MDL hypothesis • Solve the MDL problem (on the training data) and you are likely to generalize well (accuracy on the test data) The MDL Problem • Minimize |description of general concept| + λ | list of exceptions (in the train set) |
Small Disjuncts (Holte et al. IJCAI 1989) • Results of learning can often be viewed as a disjunction of conjunctions • Definition: small disjuncts – Disjuncts that correctly classify few training examples • Not necessarily small in area.
The Problem with Small Disjuncts • Collectively, cover much of the training data, but account for much of the testset error • One study • Cover 41% of training data and produce 95% of the test set error • The “small-disjuncts problem” still an open issue (See Quinlan paper in MLJ for additional discussion).
Overfitting Avoidance Wrapup • Note: fundamental issue in all of ML, not just decision trees; after all, easy to exactly match training data via “table lookup”) • Approaches • Use simple ML algorithm from the start. • Optimize accuracy on a tuning set. • Only make distinctions that are statistically justified. • Minimize |concept descriptions| + λ |exception list|. • Use ensembles to average out overfitting (next topic).
Decision “Stumps” • Holte (MLJ) compared: • Decision trees with only one decision (decision stumps) VS • Trees produced by C4.5 (with pruning algorithm used) • Decision “stumps” do remarkably well on UC Irvine data sets • Archive too easy? • Decision stumps are a “quick and dirty” control for comparing to new algorithms. • But C4.5 easy to use and probably a better control.
C4.5 Compared to 1R (“Decision Stumps”) • Test Set Accuracy • 1st column: UCI datasets • See Holte Paper for key • Max diff: 2nd row • Min Diff: 5th row • UCI datasets too easy?
Dealing with Missing Features • Bayes nets might be the best technique if many missing features (later) • Common technique: Use EM algorithm (later) • Quinlan’s suggested approach: • During Training (on each recursive call) • Fill in missing values proportionally • If 50% red, 30% blue and 20% green (for non-missing cases), then fill missing values according to this probability distribution • Do this per output category
Simple Example • Note: by “missing features” we really mean “missing feature values” Prob(red | +) = 2/3 Prob(blue | +) = 1/3 Prob(red | - ) = 1/2 Prob(blue | - ) = 1/2 Flip weighted Coins to fill in ?’s
Missing Feature During Testing • Follow all paths, weight answers proportional to probability of each path out+(color) = 0.4 out+(red) + 0.2 out+(blue) + 0.4 out+(green) Color 40% green 40% red votes for + being the category (repeat for -) 20% blue • Repeat throughout subtrees
Why are Features Missing? • Model on previous page implicitly assumes feature values are randomly deleted • as if hit by a cosmic ray! • But values might be missing for a reason • E.g., data collector decided the values for some features are not worth recording • One suggested solution: • Let “not-recorded” be another legal value (and, hence, a branch in the decision tree)
A D-Tree Variant that Exploits Info in “Missing” Feature Values • At each recursive call, only consider features that have no missing values • E.g. • Could generalize this algorithm by penalizing features with missing values Shape Color < maybe all the missing values for color take this path >
ID3 Recap ~ Questions Addressed • How closely should we fit the training data? • Completely, then prune • Use MDL or tuning sets to choose • How do we judge features? • Use info theory (Shannon) • What if a features has many values? • Correction factor based on info theory • What if some features values are unknown (in some examples)? • Distribute based on other examples (???)
ID3 Recap (cont.) • What if some features cost more to evaluate (CAT scan vs. Temperature)? • Ad hoc correction factor • Batch vs. incremental learning? • Basically a “batch” approach; incremental variants exist but since ID3 is so fast, why not simply rerun “from scratch”?