1 / 20

Evaluation (practice)

Evaluation (practice). Predicting performance. Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data Prediction is just like tossing a (biased!) coin “Head” is a “success”, “tail” is an “error”

sobel
Download Presentation

Evaluation (practice)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation(practice)

  2. Predicting performance • Assume the estimated error rate is 25%. How close is this to the true error rate? • Depends on the amount of test data • Prediction is just like tossing a (biased!) coin • “Head” is a “success”, “tail” is an “error” • In statistics, a succession of independent events like this is called a Bernoulli process • Statistical theory provides us with confidence intervals for the true underlying proportion

  3. Confidence intervals • We can say: p lies within a certain specified interval with a certain specified confidence • Example: S=750 successes in N=1000 trials • Estimated success rate: 75% • How close is this to true success rate p? • Answer: with 80% confidence p[73.2,76.7] • Another example: S=75 and N=100 • Estimated success rate: 75% • With 80% confidence p[69.1,80.1] • I.e. the probability that p[69.1,80.1] is 0.8. • Bigger the N more confident we are, i.e. the surrounding interval is smaller. • Above, for N=100 we were less confident than for N=1000.

  4. Mean and Variance • Let Y be the random variable with possible values 1 for success and 0 for error. • Let probability of success be p. • Then probability of error is q=1-p. • What’s the mean? 1*p + 0*q = p • What’s the variance? (1-p)2*p + (0-p)2*q = q2*p+p2*q = pq(p+q) = pq

  5. Estimating p • Well, we don’t know p. Our goal is to estimate p. • For this we make N trials, i.e. tests. • More trials we do more confident we are. • Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y. • Now, we approximate p with the success rate in N trials, i.e. S/N. • By the Central Limit Theorem, when N is big, the probability distribution of the random variable f=S/N is approximated by a normal distribution with • mean p and • variance pq/N.

  6. Estimating p • c% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by: Pr[− z≤ X≤ z]= c • With a symmetric distribution: Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z] • Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable f=S/N to have 0 mean and unit variance

  7. Estimating p Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable S/N to have 0 mean and unit variance: Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90% Now we solve two equations: (S/N – p) / S/N =1.65 (S/N – p) / S/N =-1.65

  8. Estimating p Let N=100, and S=70 S/N is sqrt(pq/N) and we approximate it by sqrt(p'(1-p')/N) where p' is the estimation of p, i.e. 0.7 So, S/N is approximated by sqrt(.7*.3/100) = .046 The two equations become: (0.7 – p) / .046 =1.65 p = .7 - 1.65*.046 = .624 (0.7 – p) / .046 =-1.65 p = .7 + 1.65*.046 = .776 Thus, we say: With a 90% confidence we have that the success rate p of the classifier will be 0.624  p  0.776

  9. Exercise • Suppose I want to be 95% confident in my estimation. • Looking at a detailed table we find: Pr[−2≤ X≤2]  95% • Normalizing S/N, we need to solve: (S/N – p) / f =2 (S/N – p) / f =-2 • We approximate fwith • where p' is the estimation of p through trials, i.e. S/N • So we need to solve: • So,

  10. Exercise • Suppose N=1000 trials, S=590 successes • p'=S/N=590/1000 = .59

  11. Cross-validation k-fold cross-validation: First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training • The error estimates are averaged to yield an overall error estimate

  12. Comparing data mining Schemes • Frequent question: which of two learning schemes performs better? • Obvious way: compare for example 10-fold Cross Validation estimates • Problem: variance in estimate • We don’t know whether the results are reliable • need to use statistical-test for that

  13. Paired t-test • Student’s t-test tells whether the means of two samples are significantly different. • In our case the samples are cross-validation estimates for different datasets from the domain • Use a paired t-test because the individual samples are paired • The same Cross Validation is applied twice

  14. Distribution of the means • x1, x2, … xk and y1, y2, … yk are the 2k samples for the k different datasets • mx and my are the means • With enough samples, the mean of a set of independent samples is normally distributed • Estimated variances of the means are • sx2 / k and sy2 / k • If x and y are the true means then the following are approximately normally distributed with mean 0, and variance 1:

  15. Student’s distribution • With small samples (k < 30) the mean follows Student’s distribution with k–1 degrees of freedom • similar shape, but wider than normal distribution • Confidence limits (mean 0 and variance 1):

  16. Distribution of the differences • Let md = mx – my • The difference of the means (md) also has a Student’s distribution with k–1 degrees of freedom • Let sd2 be the estimated variance of the difference • The standardized version of md is called the t-statistic:

  17. Performing the test • Fix a significance level • If a difference is significant at the  %level, there is a (100- )% chance that the true means differ • Divide the significance level by two because the test is two-tailed • Look up the value for z that corresponds to /2 • If t–z or tz then the difference is significant • I.e. the null hypothesis (that the difference is zero) can be rejected

  18. Example • We have compared two classifiers through cross-validation on 10 different datasets. • The error rates are: Dataset Classifier A Classifier B Difference 1 10.6 10.2 .4 2 9.8 9.4 .4 3 12.3 11.8 .5 4 9.7 9.1 .6 5 8.8 8.3 .5 6 10.6 10.2 .4 7 9.8 9.4 .4 8 12.3 11.8 .5 9 9.7 9.1 .6 10 8.8 8.3 .5

  19. Example • md = 0.48 • sd = 0.0789 • The critical value of t for a two-tailed statistical test,  = 10% and 9 degrees of freedom is: 1.83 • 19.24 is way bigger than 1.83, so classifier B is much better than A.

  20. Dependent estimates • We assumed that we have enough data to create • several datasets of the desired size • Need to reuse data if that's not the case • E.g. running cross-validations with different randomizations on the same data • Samples become dependent  insignificant differences can become significant • A heuristic test is the corrected resampled t-test: • Assume we use the repeated holdout method, with n1 instances for training and n2 for testing • New test statistic is:

More Related