1 / 30

Evaluation

Evaluation. Evaluation. How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data Simple solution that can be used if lots of (labeled) data is available: Split data into training and test set

lbaldwin
Download Presentation

Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation

  2. Evaluation • How predictive is the model we learned? • Error on the training data is not a good indicator of performance on future data • Simple solution that can be used if lots of (labeled) data is available: • Split data into training and test set • However: (labeled) data is usually limited • More sophisticated techniques need to be used

  3. Training and testing • Natural performance measure for classification problems: error rate • Success: instance’s class is predicted correctly • Error: instance’s class is predicted incorrectly • Error rate: proportion of errors made over the whole set of instances • Resubstitution error: error rate obtained from training data • Resubstitution error is (hopelessly) optimistic! • Test set: independent instances that have played no part in formation of classifier • Assumption: both training data and test data are representative samples of the underlying problem

  4. Making the most of the data • Once evaluation is complete, all the data can be used to build the final classifier • Generally, the larger the training data the better the classifier • The larger the test data the more accurate the error estimate • Holdout procedure: method of splitting original data into training and test set • Dilemma: ideally both training set and test set should be large!

  5. Predicting performance • Assume the estimated error rate is 25%. How close is this to the true error rate? • Depends on the amount of test data • Prediction is just like tossing a (biased!) coin • “Head” is a “success”, “tail” is an “error” • In statistics, a succession of independent events like this is called a Bernoulli process • Statistical theory provides us with confidence intervals for the true underlying proportion

  6. Confidence intervals • We can say: p lies within a certain specified interval with a certain specified confidence • Example: S=750 successes in N=1000 trials • Estimated success rate: 75% • How close is this to the true success rate p? • Answer: with 80% confidence p[73.2,76.7] • Another example: S=75 and N=100 • Estimated success rate: 75% • With 80% confidence p[69.1,80.1] • I.e. the probability that p[69.1,80.1] is 0.8. • Bigger the N more confident we are, i.e. the surrounding interval is smaller. • Above, for N=100 we were less confident than for N=1000.

  7. Mean and Variance • Let Y be the random variable with possible values 1 for success and 0 for error. • Let probability of success be p. • Then probability of error is q=1-p. • What’s the mean? 1*p + 0*q = p • What’s the variance? (1-p)2*p + (0-p)2*q = q2*p+p2*q = pq(p+q) = pq

  8. Estimating p • Well, we don’t know p. Our goal is to estimate p. • For this we make N trials, i.e. tests. • More trials we do more confident we are. • Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y. • Now, we approximate p with the success rate in N trials, i.e. S/N. • By the Central Limit Theorem, when N is big, the probability distribution of the random variable f=S/N is approximated by a normal distribution with • mean p and • variance pq/N.

  9. Estimating p • c% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by: Pr[− z≤ X≤ z]= c • With a symmetric distribution: Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z] • Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable f=S/N to have 0 mean and unit variance

  10. Estimating p Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable S/N to have 0 mean and unit variance: Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90% Now we solve two equations: (S/N – p) / S/N =1.65 (S/N – p) / S/N =-1.65

  11. Estimating p Let N=100, and S=70 S/N is sqrt(pq/N) and we approximate it by sqrt(p'(1-p')/N) where p' is the estimation of p, i.e. 0.7 So, S/N is approximated by sqrt(.7*.3/100) = .046 The two equations become: (0.7 – p) / .046 =1.65 p = .7 - 1.65*.046 = .624 (0.7 – p) / .046 =-1.65 p = .7 + 1.65*.046 = .776 Thus, we say: With a 90% confidence we have that the success rate p of the classifier will be 0.624  p  0.776

  12. Exercise • Suppose I want to be 95% confident in my estimation. • Looking at a detailed table we find: Pr[−2≤ X≤2]  95% • Normalizing S/N, we need to solve: (S/N – p) / S/N =2 (S/N – p) / S/N =-2 • We approximate S/Nwith • where p' is the estimation of p through trials, i.e. S/N • So we need to solve: • So,

  13. Exercise • Suppose N=1000 trials, S=590 successes • p'=S/N=590/1000 = .59

  14. Holdout estimation • What to do if the amount of data is limited? • The holdout method reserves a certain amount for testing and uses the remainder for training • Usually: one third for testing, the rest for training • Problem: the samples might not be representative • Example: class might be missing in the test data • Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets

  15. Repeated holdout method • Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdoutmethod • Still not optimum: the different test sets overlap • Can we prevent overlapping?

  16. Cross-validation • Cross-validation avoids overlapping test sets • First step: split data into k subsets of equal size • Second step: use each subset in turn for testing, the remainder for training • Called k-fold cross-validation • Often the subsets are stratified before the cross-validation is performed • The error estimates are averaged to yield an overall error estimate • Standard method for evaluation: stratified 10-fold cross-validation

  17. Leave-One-Out cross-validation • Leave-One-Out:a particular form of cross-validation: • Set number of folds to number of training instances • I.e., for n training instances, build classifier n times • Makes best use of the data • Involves no random subsampling • But, computationally expensive

  18. Leave-One-Out-CV and stratification • Disadvantage of Leave-One-Out-CV: stratification is not possible • It guarantees a non-stratified sample because there is only one instance in the test set! • Extreme example: completely random dataset split equally into two classes • Best inducer predicts majority class • 50% accuracy on fresh data • Leave-One-Out-CV estimate is 100% error!

  19. The bootstrap • Cross Validation uses sampling without replacement • The same instance, once selected, can not be selected again for a particular training/test set • The bootstrap uses sampling with replacement to form the training set • Sample a dataset of n instances n times with replacement to form a new dataset of n instances • Use this data as the training set • Use the instances from the original dataset that don’t occur in the new training set for testing • Also called the 0.632 bootstrap (Why?)

  20. The 0.632 bootstrap • Also called the 0.632 bootstrap • A particular instance has a probability of 1–1/n of not being picked • Thus its probability of ending up in the test data is: • This means the training data will contain approximately 63.2% of the instances

  21. Estimating errorwith the bootstrap • The error estimate on the test data will be very pessimistic • Trained on just ~63% of the instances • Therefore, combine it with the resubstitution error: • The resubstitution error gets less weight than the error on the test data • Repeat process several times with different replacement samples; average the results • Probably the best way of estimating performance for very small datasets

  22. Comparing data mining schemes • Frequent question: which of two learning schemes performs better? • Note: this is domain dependent! • Obvious way: compare 10-fold Cross Validation estimates • Problem: variance in estimate • Variance can be reduced using repeated CV • However, we still don’t know whether the results are reliable (need to use statistical-test for that)

  23. Counting the cost • In practice, different types of classification errors often incur different costs • Examples: • Cancer Detection • “Not cancer” correct 99% of the time • Loan decisions • Fault diagnosis • Promotional mailing

  24. Counting the cost • The confusion matrix:

  25. Paired t-test • Student’s t-test tells whether the means of two samples are significantly different. • In our case the samples are cross-validation samples for different datasets from the domain • Use a paired t-test because the individual samples are paired • The same Cross Validation is applied twice

  26. Distribution of the means • x1, x2, … xk and y1, y2, … yk are the 2k samples for the k different datasets • mx and my are the means • With enough samples, the mean of a set of independent samples is normally distributed • Estimated variances of the means are • sx2 / k and sy2 / k • If x and y are the true means then the following are approximately normally distributed with mean 0, and variance 1:

  27. Distribution of the differences • Let md = mx – my • The difference of the means (md) has a Student’s distribution with k–1 degrees of freedom • Let sd2 be the estimated variance of the difference • The standardized version of md is called the t-statistic:

  28. Performing the test • Fix a significance level • If a difference is significant at the  %level, there is a (100- )% chance that the true means differ • Divide the significance level by two because the test is two-tailed • Look up the value for z that corresponds to /2 • If t–z or tz then the difference is significant • I.e. the null hypothesis (that the difference is zero) can be rejected

  29. Example • We have compared two classifiers through cross-validation on 10 different datasets. • The error rates are: Dataset Classifier A Classifier B Difference 1 10.6 10.2 .4 2 9.8 9.4 .4 3 12.3 11.8 .5 4 9.7 9.1 .6 5 8.8 8.3 .5 6 10.6 10.2 .4 7 9.8 9.4 .4 8 12.3 11.8 .5 9 9.7 9.1 .6 10 8.8 8.3 .5

  30. Example • md = 0.48 • sd = 0.0789 • The critical value of t for a two-tailed statistical test,  = 10% and 9 degrees of freedom is: 1.83 • 19.24 is way bigger than 1.83, so classifier B is much better than A.

More Related