1 / 60

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 11 part 1 2/18/2013. Recommended reading. Ensemble learning, cross-validation Marsland Chapter 7, on web page Hastie 7.10-7.11, 8.7. Outline. Ensemble learning Cross-validation Voting Bagging Boosting.

kendis
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 11 part 1 2/18/2013

  2. Recommended reading • Ensemble learning, cross-validation • Marsland Chapter 7, on web page • Hastie 7.10-7.11, 8.7

  3. Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting

  4. Ensemble learning • Various techniques to improve performance over results of basic classification algorithms • Use multiple classifiers • Select subset of training set • Weight training set points • Algorithms • Cross-validation • Voting • Bagging • Boosting

  5. Ensemble learning and model selection • Available models • Range of possible parameterizations of model • Choice of learning algorithms to combine • How well the model fits the data • Separate points in training data • Generalize to new data • Balance simplicity of model and fit to data • Noisy data more robustness to noisy data • Separabilitymore complicated decision boundary • Maximum margin • Computational issues • Not substantially more than original algorithms

  6. Example of ensemble learning:Netflix prize • http://en.wikipedia.org/wiki/Netflix_Prize • Collaborative filtering / recommender system: predict which movies a user will like, given ratings that the user assigned to other movies • In 2006, Netflix offered $1,000,000 to anyone who could improve on their system by 10% • Training set: • 100,480,507 ratings (1 to 5 stars) • 480,189 users • 17,700 movies

  7. Netflix prize • Winner in 2009, 10.09% improvement • Teams with highest improvements used ensemble learning extensively, joining forces with other teams & combining together different algorithms • http://www.wired.com/epicenter/2009/09/how-the-netflix-prize-was-won/ • http://www.quora.com/Netflix-Prize/Is-there-any-summary-of-top-models-for-the-Netflix-prize

  8. Netflix prize • Finally, let's talk about how all of these different algorithms were combined to provide a single rating that exploited the strengths of each model. (Note that, as mentioned above, many of these models were not trained on the raw ratings data directly, but rather on the residuals of other models.) • In the paper detailing their final solution, the winners describe using gradient boosted decision trees to combine over 500 models; previous solutions used instead a linear regression to combine the predictors. • Briefly, gradient boosted decision trees work by sequentially fitting a series of decision trees to the data; each tree is asked to predict the error made by the previous trees, and is often trained on slightly perturbed versions of the data.

  9. Effect of ensemble size(from early in the competition, in 2007)

  10. Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting

  11. Decision boundaries and overfitting • Simple • Hyperplane (perceptron, SVM) • Complicated • Decision Tree • Neural network • K-nearest neighbors, especially when k is small • Algorithms that learn complicated decision boundaries have a tendency to overfit the training data

  12. Overfitting • Overfitting: algorithm learns about noise and specific data points in the training set, instead of the general pattern • Error rate on training set is minimized, but error rate on test set increases

  13. Validation set • Problem: easy to overfit to training data • Want to minimize error on training in order to learn statistical characteristics of data • But this may cause increased error in testing • Solution: reserve portion of training data as validation set • Apply classifier to validation set during training • Now you have an “external” measure of the quality of the model • Problem: you could overfit on the validation set if you repeatedly optimize your classifier against it

  14. Better solution: cross-validation • K-fold cross-validation: • Split up training data into K equally-sized sets (folds) • Example: 6 folds. Blue: training, Yellow: validation • Produces K different classifiers • Train classifier on K-1 folds, test on the other fold • Calculate average performance over all folds • Less likely to overfit, compared to using a single validation set

  15. Cross validation does not produce a new classifier • It gives a more honest measure of the performance of your classifier, by training and testing over different subset of your data

  16. Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting

  17. Use multiple classifiers • Different classifiers make different errors • Each classifier has its own “opinion” about the data • Obtain results from multiple classifiers trained for the same problem • When combining multiple classifiers trained on the same data set, performance is often better than for a single classifier

  18. Source of classifiers • Could use classifiers constructed by different algorithms • SVM • Perceptron • Neural network • etc. • Or could use classifiers trained on different folds or subsets of the training data

  19. Classifier combination through voting • Simplest way to combine results from multiple classifiers: majority rules • Example • Have 10 classifiers and an instance to classify. • 7 say “Yes” and 3 say “No”. • Choose “Yes” as your classification. • Minor improvements: • Attach a confidence score to each classification • Availability of “confidence” depends on choice of classifier • Skip cases where classifiers tend to disagree • 50% “Yes”, 50% “No”

  20. Voting works better than individual classifiers • Performance of voting is often better than for individual classifiers • Why: more complicated decision boundaries • An algorithm’s learning bias predisposes it to acquire particular types of decision boundaries • Additive effect of multiple classifiers allows one to determine which classifications are more likely to be correct, and which ones are incorrect

  21. Learning bias: classifier is predisposed to learn certain types of decision boundaries • Perceptron: hyperplane • SVM: • Hyperplane with max margin, in kernel feature space • May be nonlinear in original feature space • Decision Tree: • Hierarchical splits • Each is a constant on a feature axis, with a limited range • Neural network: any smooth function • K-nearest neighbors • Shape of decision boundary depends on K and choice of distance function

  22. Apply voting to this data

  23. Classifier A YES NO

  24. Classifier B NO YES

  25. Learn correct decision boundary through majority vote of 2 classifiers 1 vote 2 votes 0 votes 1 vote

  26. Complicated decision boundaries from classifier combination and voting

  27. Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting

  28. Two components of bagging • 1. Classifier combination with majority vote • Previous section • 2. Each classifier is trained on a set of bootstrap samples from the training data • Bootstrap: sample data set with replacement • For example: data set has 1,000 training points. • Sample 1,000 (or some other quantity) points, with replacement • It’s likely that the sampled data set will not contain all of the data in the original data set • It’s likely that the sampled data set will contain multiple instances of data points of the original data set

  29. Train on a set of bootstrap samples • Suppose there are N instances in the training set. • Generate one set of bootstrap samples: • Randomly sample from the training set, with replacement (i.e., data may be repeated) • Sample N times to build a new training set • Then train a classifier on each bootstrap sample • Can use same algorithm, or different algorithms

  30. Weak classifiers • Classifier combination can perform well even if individual classifiers are “weak” (= perform relatively poorly) • Example: stump classifier • Construct the root node of a decision tree by splitting on one feature • Do not construct rest of tree • Generates a constant decision boundary on one feature axis

  31. Last time: full decision tree(augmented with # of training cases) 5 1 2 1 1

  32. Example of a stump • In the full decision tree, there were 3 classes of data under Party=No: • Study (2), Watch (1), Pub (1), TV(1) • Stump: assign majority class as label for a leaf • This stump performs poorly: can’t guess Pub or TV Go to Party Study

  33. Geometric view: Classifier A is a stump YES NO

  34. Geometric view: Classifier B is a stump NO YES

  35. How weak can classifiers be? • Suppose there are T classifiers. • Let p be the success rate of a classifier. • Assume each classifier has the same success rate. • What value does p need to be for the entire ensemble of T classifiers to get the correct answer?

  36. k: number of succeeding classifiers • T different classifiers • The ensemble gets the right answer if more than half of the individual classifiers succeed, i.e., k = T/2 + 1 (or higher) • Calculations to be done • What is the probability of k classifiers all succeeding? • What is the probability that there will be at least k = T/2 + 1 succeeding classifiers?

  37. 1. Probability that k classifiers all succeed • We have T total classifiers. • k succeed, each with probability p • T - k fail, each with probability (1-p) • C(T, k) different combinations for which classifiers succeed and which ones fail • Probability that k classifiers all succeed, out of T: (This is a binomial distribution)

  38. Combinationshttp://en.wikipedia.org/wiki/Combination • C(T, k) is the number of subsets of size k from a set of size T ; answer is given by binomial coefficient • Example: subsets of size 3 from a set of 5 elements

  39. 2. Probability that the ensemble succeeds • Between k = T/2+1 and k = T classifiers agree, so sum the probability of agreement for all values of k in this range. • If p > 0.5, this sum approaches 1 as T infinity. • i.e., if even individual classifiers perform badly, the more classifiers we have, the more likely that the ensemble will classify correctly

  40. How weak can classifiers be? • Suppose there are T classifiers. • Let p be the success rate of a classifier. • Assume each classifier has the same success rate. • What value does p need to be for the entire ensemble of T classifiers to get the correct answer? • Answer: p > 0.5 !!!

  41. Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting

  42. Boosting: I • Each training example has a weight. • Boosting learns a series of classifiers over a number of iterations. • At each iteration: • Learn a classifier according to weighted examples • Compute misclassified examples • Increase weight for misclassified examples • Gives more priority to these for next iteration of training

  43. Boosting: II • Boosting learns an ensemble of classifiers, one for each training iteration • Each classifier is assigned a weight, according to how well it does • Classification of the ensemble is determined by applying weights to classifications of individual classifiers

  44. Decision boundary of each classifierat each iteration

  45. Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf • Learns a stump classifier • at each iteration

  46. Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf εt : weighted error rate of classifier αt: weight assigned to classifier

  47. Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf εt : weighted error rate of classifier αt: weight assigned to classifier

  48. Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf εt : weighted error rate of classifier αt: weight assigned to classifier

  49. Final classifier: compute weighted sum of outputs of individual classifiers, using the αt’s Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf

  50. AdaBoost: adaptive boosting

More Related