1 / 75

Announcements

Announcements. No reading assignment for next week Prepare for exam Midterm exam next week. Last Time Neural nets (briefly) Decision trees Today More decision trees Ensembles Exam Review Next time Exam Advanced ML topics. A 2. + 3. + 4. + 5. + 1. + 2. - 1. - 2. - 3. + 3.

Download Presentation

Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Announcements • No reading assignment for next week • Prepare for exam • Midterm exam next week

  2. Last Time • Neural nets (briefly) • Decision trees • Today • More decision trees • Ensembles • Exam Review • Next time • Exam • Advanced ML topics

  3. A2 +3 +4 +5 +1 +2 -1 -2 -3 +3 +4 +5 A1 A1 A1 A1 A3 A3 A3 A3 A4 A4 A4 A4 A1 A3 -1 -2 A1 A3 Overview of ID3 Use Majority class at parent node *NULL* Splitting Attribute +1 +2 +3 +4 +5 ID3 + -1 -2 -3 - A1 A2 A4 A3 A4 + - Splitting Attribute

  4. Color Shape Size Class Red BIG + Blue BIG + Red SMALL - Yellow SMALL - Red BIG + Example Info Gain Calculation

  5. Info Gain Calculation (contd.) Note that “Size” provides complete classification.

  6. Runtime Performance of ID3 • Let E = # examples F = # features • At level 1 Look at each feature Look at each ex (to get feature value) Work to choose 1 feature = O(F x E)

  7. Runtime Performance of ID3 (cont.) • In worst case, need to consider all features along all paths (full tree) O(F2 x E) Reasonably efficient

  8. COLOR ? Green Blue - SIZE ? Big + Small + - Generating Rules • Antecedent: Conjuction of all decisions leading to terminal node • Consequent: Label of terminal node • Example Red

  9. Generating Rules (cont.) • Generates rules: Color=Green  - Color=Blue  + Color=Red and Size=Big  + Color=Red and Size=Small  - • Note: 1. Can “clean up” the rule set (see Quinlan’s) 2. Decision trees learn disjunctive concepts

  10. Noise-A Major Issue in ML • Worst Case +, - at same point in feature space • Causes 1. Too few features (“hidden variables”) or too few possible values 2. Incorrectly reported/measured/judged feature values 3. mis-classified instances

  11. + + + + - + + + + + - + - - - - - - - - - - Noise-A Major Issue in ML (cont.) • Issue – overfitting Producing an “awkward” concept because of a few “noisy” points. Bad performance on future ex’s? Better performance?

  12. Overfitting Viewed in Terms of Function-Fitting Data = Red Line + Noise Model + + + + + + + + + + + + + + f(x) x

  13. Training set accuracy of S Training set accuracy of C > but Test set accuracy of S Test set accuracy of C < Definition of Overfitting • Assuming large enough test set so that it is representative. Concept C overfit the training data if there exists a “simpler” concept S so that

  14. Remember! • It is easy to learn/fit the training data • What’s hard is generalizing well to future (“test set”) data! • Overfitting avoidance is a key issue in Machine Learning

  15. Can One Underfit? • Sure, if not fully fitting the training set -eg, just return majority category (+ or -) in the trainset as the learned model. • But also if not enough data to illustrate the important distinctions.

  16. ID3 & Noisy Data • To avoid overfitting, allow splitting to stop before all ex’s are of one class. • Option 1: if info left < E, don’t split -empirically failed; bad performance on error-free data (Quinlan)

  17. ID3 & Noisy Data (cont.) • Option 2: Estimate if all remaining features are statistically independent of the class of remaining examples -uses “chi test” of original ID3 paper -works well on error-free data

  18. ID3 & Noisy Data (cont.) • Option 3: (not in original ID3 paper) Build complete tree, then use some “spare” (tuning) examples to decide which parts of tree can be pruned.

  19. ID3 & Noisy Data (cont.) • Pruning is currently the best choice—see c4.5 for technical details • Repeat using greedy algo.

  20. best Stop if no improvement Greedily Pruning D-trees • Sample (Hill Climbing) Search Space

  21. + Pruning by Measuring Accuracy on Tune Set • Run ID3 to fully fit TRAIN’ Set, measure accuracy on TUNE • Consider all subtrees where ONE interior node removed and replaced by leaf -label with majority category in pruned subtree choose best subtree on TUNE if no improvement, quit 3. Go to 2

  22. R A B C D F E Initial The Tradeoff in Greedy Algorithm • Efficiency vs Optimality Eg IF “Tune” best cuts is to discard C’s & F’s subtrees BUT The single best cut is too discard B’s subtrees Greedy Search will not find best tree Greedy Search: Powerful, General Purpose, Trick – of - Trade

  23. R [64] Accuracy if we replace this node with a leaf (leaving rest of the tree the same) A B [89] [77] C [88] D [63] F [87] E [74] Pruning @ B works best Full-Tree Accuracy = 85% on TUNE set Hypothetical Trace of a Greedy Algorithm

  24. R [64] A B [89] [77] Hypothetical Trace of a Greedy Algorithm (cont.) • Full-Tree Accuracy = 89% - STOP since no improvement by cutting again, and return above tree.

  25. Another Possibility: Rule Post-Pruning(also greedy algoritm) • Induce a decision tree • Convert to rules (see earlier slide) • Consider dropping one rule antecedent • Delete the one that improves tuning set accuracy the most. • Repeat as long as progress being made.

  26. Rule Post-Pruning (Continue) • Advantages • Allows an intermediate node to be pruned from some rules but retained in others. • Can correct poor early decisions in tree construction. • Final concept more understandable.

  27. Training with Noisy Data • If we can clean up the training data, should we do so? • No (assuming one can’t clean up the testing data when the learned concept will be used). • Better to train with the same type of data as will be experienced when the result of learning is put into use.

  28. Overfitting + Noise • Using the strict definition of overfitting presented earlier, is it possible to overfit noise-free data? • In general? • Using ID3?

  29. Example of Overfitting of Noise-free Data Let • Correct concept = A ^ B • Feature C to be true 50% of the time, for both + and – examples • Prob(+ example) = 0.9 • Training Set: • +: ABCDE, ABC¬DE, ABCD¬E • -: A¬B¬CD¬E, ¬AB¬C¬DE

  30. Example (Continued) Tree Trainset Accuracy TestSet Accuracy ID3’s 100% 50% Simpler “tree” 60% 90% C F T + - +

  31. Post Pruning • There are more sophisticated methods of deciding where to prune than simply estimating accuracy on a tuning set. • See the C4.5 and CART books for details. • We won’t discuss them, except for MDL • Tuning sets also called • Pruning sets (in d-tree algorithms) • Validation sets (in general)

  32. Tuning Sets vs MDL • Two ways to deal with overfitting • Tuning Sets • Empirically evaluate pruned trees • MDL (Minimal Description Length) • Theoretically evaluate/score pruned trees • Describe training data in as few bits as possible (“compression”)

  33. MDL (continue) • No need to hold aside training data • But how good is the MDL hypothesis? • Heuristic: MDL => good generalization

  34. The Minimal Description Length (MDL) Principle (Rissanen, 1986; Quinlan and Rivest, 1989) • Informally, we want to view a training set as data = general rule + exceptions to the rule (“noise”) • Tradeoff between • Simple rule, but many exceptions • Complex rule with few exceptions • How to make this tradeoff? • Try to minimize the “description length” of the rule + exceptions

  35. Trading Off Simplicity vs Coverage A weighting factor, user-defined or use tuning set Description Length Size of Rules Size of Exceptions = + λ x # bits needed to represent a decision tree that covers (possibly incompletely) the training examples # bits needed to encode the exceptions to this decision tree minimize • Key issue: what’s the best coding strategy to use?

  36. A Simple MDL Algorithm • Build the full tree using ID3 (and all the training examples) • Consider all/many subtrees, keeping the one that minimizes: • score = (# nodes in tree) + λ * (error rate on training set) (A crude scoring function) Some details: If # features = Nf and # examples = Ne then need Ceiling(log2Nf) bits to encode each tree node and Ceiling (log2Ne) bits to encode an exception.

  37. Searching the Space of Pruned D-trees with MDL • Can use same greedy search algorithm used with pruning sets • But use MDL score rather than pruning set accuracy as the heuristic function

  38. MDL Summarized The overfitting problem • Can exactly fit the training data, but will this generalize well to test data? • Tradeoff some training-set errors for fewer test-set errors • One solution – the MDL hypothesis • Solve the MDL problem (on the training data) and you are likely to generalize well (accuracy on the test data) The MDL Problem • Minimize |description of general concept| + λ | list of exceptions (in the train set) |

  39. Small Disjuncts (Holte et al. IJCAI 1989) • Results of learning can often be viewed as a disjunction of conjunctions • Definition: small disjuncts – Disjuncts that correctly classify few training examples • Not necessarily small in area.

  40. The Problem with Small Disjuncts • Collectively, cover much of the training data, but account for much of the testset error • One study • Cover 41% of training data and produce 95% of the test set error • The “small-disjuncts problem” still an open issue (See Quinlan paper in MLJ for additional discussion).

  41. Overfitting Avoidance Wrapup • Note: fundamental issue in all of ML, not just decision trees; after all, easy to exactly match training data via “table lookup”) • Approaches • Use simple ML algorithm from the start. • Optimize accuracy on a tuning set. • Only make distinctions that are statistically justified. • Minimize |concept descriptions| + λ |exception list|. • Use ensembles to average out overfitting (next topic).

  42. Decision “Stumps” • Holte (MLJ) compared: • Decision trees with only one decision (decision stumps) VS • Trees produced by C4.5 (with pruning algorithm used) • Decision “stumps” do remarkably well on UC Irvine data sets • Archive too easy? • Decision stumps are a “quick and dirty” control for comparing to new algorithms. • But C4.5 easy to use and probably a better control.

  43. C4.5 Compared to 1R (“Decision Stumps”) • Test Set Accuracy • 1st column: UCI datasets • See Holte Paper for key • Max diff: 2nd row • Min Diff: 5th row • UCI datasets too easy?

  44. Dealing with Missing Features • Bayes nets might be the best technique if many missing features (later) • Common technique: Use EM algorithm (later) • Quinlan’s suggested approach: • During Training (on each recursive call) • Fill in missing values proportionally • If 50% red, 30% blue and 20% green (for non-missing cases), then fill missing values according to this probability distribution • Do this per output category

  45. Simple Example • Note: by “missing features” we really mean “missing feature values” Prob(red | +) = 2/3 Prob(blue | +) = 1/3 Prob(red | - ) = 1/2 Prob(blue | - ) = 1/2 Flip weighted Coins to fill in ?’s

  46. Missing Feature During Testing • Follow all paths, weight answers proportional to probability of each path out+(color) = 0.4 out+(red) + 0.2 out+(blue) + 0.4 out+(green) Color 40% green 40% red votes for + being the category (repeat for -) 20% blue • Repeat throughout subtrees

  47. Why are Features Missing? • Model on previous page implicitly assumes feature values are randomly deleted • as if hit by a cosmic ray! • But values might be missing for a reason • E.g., data collector decided the values for some features are not worth recording • One suggested solution: • Let “not-recorded” be another legal value (and, hence, a branch in the decision tree)

  48. A D-Tree Variant that Exploits Info in “Missing” Feature Values • At each recursive call, only consider features that have no missing values • E.g. • Could generalize this algorithm by penalizing features with missing values Shape Color < maybe all the missing values for color take this path >

  49. ID3 Recap ~ Questions Addressed • How closely should we fit the training data? • Completely, then prune • Use MDL or tuning sets to choose • How do we judge features? • Use info theory (Shannon) • What if a features has many values? • Correction factor based on info theory • What if some features values are unknown (in some examples)? • Distribute based on other examples (???)

  50. ID3 Recap (cont.) • What if some features cost more to evaluate (CAT scan vs. Temperature)? • Ad hoc correction factor • Batch vs. incremental learning? • Basically a “batch” approach; incremental variants exist but since ID3 is so fast, why not simply rerun “from scratch”?

More Related