1 / 32

Evaluating Hypotheses

Evaluating Hypotheses. How good is my classifier?. How good is my classifier?. Have seen the accuracy metric Classifier performance on a test set. If we are to trust a classifier’s results Must keep the classifier blindfolded Make sure that classifier never sees the test data

kasa
Download Presentation

Evaluating Hypotheses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Hypotheses How good is my classifier?

  2. How good is my classifier? Have seen the accuracy metric Classifier performance on a test set Evaluating Hypotheses

  3. If we are to trust a classifier’s results Must keep the classifier blindfolded Make sure that classifier never sees the test data When things seem too good to be true… First and Foremost… Evaluating Hypotheses

  4. Confusion Matrix Could collect more information Evaluating Hypotheses

  5. Sensitivity vs. Specificity Sensitivity Out of the things predicted as being positive, how many were correct Specificity Out of the things predicted as being negative how many were correct • Not as sensitive if begins missing what it is trying to detect • If identify more and more things as target class, then beginning to get less specific Evaluating Hypotheses

  6. Can we quantify our Uncertainty? Will the accuracy hold with brand new, never before seen data? Once we’re sure no cheating is going on… Evaluating Hypotheses

  7. Binomial Distribution Discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments Successes or failures—Just what we’re looking for! Evaluating Hypotheses

  8. Binomial Distribution Probability that the random variable R will take on a specific value r Might be probability of an error or of a positive Since we have been working with accuracy let’s go with positive Book works with errors Evaluating Hypotheses

  9. Binomial Distribution Very simple calculations Evaluating Hypotheses

  10. What Does This Mean? We can use as an estimator of p Now have p and the distribution given p We have the tools to figure out how confident we should be in our estimator Evaluating Hypotheses

  11. The question How confident should I be in the accuracy measure? If we can live with statements like: 95% of the accuracy measures will fall in the range of 94% and 97% Life is good Confidence interval Evaluating Hypotheses

  12. How calculate We want the quantiles where area outside is 5% We can estimate p There are tools available in most programming languages Evaluating Hypotheses

  13. Example In R lb=qbinom(.025,n,p) ub=qbinom(.975,n,p) Lower and upper bound constitute confidence interval Evaluating Hypotheses

  14. Still, Are We Really This Confident? What if none of the small cluster of Blues were in the training set? All of them would be in the test set How well would it do? Sample error vs. true error Might have been an accident—a pathological case Evaluating Hypotheses

  15. Cross-Validation What if we could test the classifier several times with different test sets If it performed well each time wouldn’t we be more confident in the results? Reproducibility Consistency Evaluating Hypotheses

  16. K-fold Cross-Validation Usually we have a big chunk of training data If we bust it up into randomly drawn chunks Can train on remainder And test with chunk Evaluating Hypotheses

  17. K-fold Cross-Validation If 10 chunks Train 10 times Now have performance data on ten completely different test datasets Evaluating Hypotheses

  18. Must stay blindfolded while training Must discard all lessons after each fold Remember, No Cheating Evaluating Hypotheses

  19. 10-fold Appears to be Most Common Default Weka and DataMiner both default to 10-fold Could be just as easily be 20-fold or 25-fold With 20-fold it would be a 95-5 split Performance is reported as the average accuracy across the K runs Evaluating Hypotheses

  20. What is the best K? Related to the question of How large should the training set be Should be large enough to support a test set of size n such that Rule of thumb At least 30 examples not too close to 0 or 1 For ten-fold If 1/10th must be 30, Training set must be 300 If 10-fold satisfies this should be in good shape K-Fold Evaluating Hypotheses

  21. Can Even Use K=1 Called of leave-one-out Disadvantage: slow Largest possible training set Smallest possible test set Has been promoted as an unbiased estimator or error Recent studies indicate that there is no unbiased estimator Evaluating Hypotheses

  22. Recap Can calculate confidence interval with a single test set More runs (K-fold) gives us more confidence that we didn’t just get lucky in test set selection Do these runs help narrow the confidence interval? Confidence Interval Evaluating Hypotheses

  23. When we average the performance… Central limit applies As the number of runs grows the distribution approaches normal With a reasonably large number of runs we can derive a more trustworthy confidence interval With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals Evaluating Hypotheses

  24. Central Limit Theorem Consider a set of independent, identically distributed random variables Y1…Yn governed by an arbitrary probability distribution with mean and finite variance . Define the sample mean, then as the distribution governing approaches a Normal distribution, with zero mean and standard deviation equal to 1 Book: This is a quite surprising fact, because it states that we know the form of the distribution that governs the sample mean even when we do not know the form of the underlying distribution that governs the individual Evaluating Hypotheses

  25. Checking accuracy in R meanAcc = mean(accuracies) sdAcc = sd(accuracies) qnorm(.975,meanAcc,sdAcc) 0.9980772 qnorm(.025,meanAcc,sdAcc) 0.8169336 Evaluating Hypotheses

  26. Can we say that one classifier is significantly better than another T-test Null hypothesis: they are from the same distribution My Classifier’s Better than Yours Evaluating Hypotheses

  27. T-test In R t.test(distOne,distTwo,paired= TRUE) Paired t-test data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates: mean of the differences -0.1980214 Evaluating Hypotheses

  28. T-test In Perl use Statistics::TTest; my $ttest = new Statistics::TTest; $ttest->load_data(\@r1,\@r2); $ttest->set_significance(95); $ttest->print_t_test(); print "\n\nt statistic is ". $ttest->t_statistic."\n"; print "p val ".$ttest->{t_prob}."\n"; t_prob: 0 significance: 95 … df1: 29 alpha: 0.025 t_statistic: 12.8137016607408 null_hypothesis: rejected t statistic is 12.8137016607408 p val 0 Evaluating Hypotheses

  29. Example, would you trust this classifier? The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross-validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%. Evaluating Hypotheses

  30. Randomly permute an array From the Perl Cookbook http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm A Useful Technique sub fisher_yates_shuffle{ my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } } Evaluating Hypotheses

  31. Evaluating Hypotheses

  32. What about chi squared Evaluating Hypotheses

More Related