1 / 29

Evaluating Results of Learning

This article explores the criteria for evaluating the results of machine learning, including accuracy, comprehensibility, and confidence intervals. It discusses various evaluation schemes, such as cross-validation and random sampling, and also covers statistics like calibration, discrimination, sensitivity, specificity, and area under the ROC curve.

Download Presentation

Evaluating Results of Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Results of Learning Blaž Zupan www.ailab.si/blaz/predavanja/uisp

  2. Evaluating ML Results • Criteria • Accuracy of induced concepts (predictive accuracy) • accuracy = probability of correct classification • error rate = 1 -accuracy • Comprehensibility • Both are important • but comprehensibility is hard to measure • accuracy usually studied • Kinds of accuracy • Accuracy on learning data • Accuracy on new data (much more important) • Major topic: estimating accuracy on new data

  3. Usual Procedure toEstimate Accuracy All available data Learning set (Training set) Test set (Holdout set) Internal Validation Learning System Induced Classifier External Validation Accuracy on test data Main idea: accuracy on test data approximates accuracy on new data

  4. Problems • Common mistake • estimating accuracy on new data by accuracy on learning data (resubstitution accuracy) • Size of the data set • hopefully test set is representative for new data • no problem when available data abounds • Scarce data: major problem • much data is needed for successful learning • much data is needed for reliable accuracy estimate

  5. Estimating Accuracy from Test Set • Consider • Induced classifier classifies a=73% of test cases correctly • So we expect accuracy on a new data close to 75%. But: • How close? • How confident we are in this estimate? (this depends on the size of the testing data set)

  6. Confidence Intervals • Can be used to assess the confidence for our accuracy estimates • Confidence intervals success rate on test data 0% 50% 100% 95% confidence interval

  7. Evaluation Schemes(sampling methods)

  8. 3-Fold Cross Validation reoder arbitrarily train & test #1 train & test #2 train & test #3 train evaluate statistics for each iteration and then compute the average test dataset

  9. k-Fold Cross Validation • Split the data to k sets of approximately equal size (and class distribution, if stratified) • For i=1 to k: • Use i-th subset for testing and remaining (k-1) subsets for training • Compute average accuracy • k-fold CV can be repeated several, say, 100 times

  10. Random Sampling (70/30) • Random split data to, say, • 70% data for training • 30% data for testing • Learn on training, test on testing data • Repeat procedure, say, 100 times, and compute the average accuracy and its confidence intervals

  11. Statisticscalibrationdiscrimination

  12. Calibration and Discrimination • Calibration • how accurate are probabilities assigned by the induced model • classification accuracy, sensitivity, specificity, ... • Discrimination • how good would the model be to distinguish between positive and negative cases • area under ROC

  13. Test Statistics:Contingency Table of Classification Results • true positive, false positive • false negative, true negative

  14. Classification Accuracy • CA = (TP+TN) / N • Proportion of correctly classified examples

  15. Sensitivity • Sensitivity = TP / (TP + FN) • Proportion of correctly detected positive examples • In medicine (+, -: presence and absence of a disease): • chance that our model correctly identifies a patient with a disease

  16. Specificity • Specificity = TN / (FP + TN) • Proportion of correctly detected negative examples • In medicine: • chance that our model correctly identifies a patient without a disease

  17. Other Statistics From DL Sackett et al.: Evidence-Based Medicine, Churchill-Livingstone, 2000.

  18. ROC Curves • ROC = Receiver Operating Characteristics • From 70s used to evaluate medical prognostic models • Recently popular within ML [rediscovery?] sensitivity [TP rate] a very good model 100% not so good model 1-specificity [FP rate] 0% 0% 100%

  19. ROC Curve T = 0 T = 0.5 T = ∞

  20. ROC Curve (Recipe) • Draw grid: • step 1/N horizontally • step 1/F vertically • Sort results by descending predicted probabilities • Start at (0,0) • From the table, select top row(s) with the highest probability • Let rows include p positive and n negative examples: move to a point p grid points up and n right • Remove selected rows • If any more rows, go to 4

  21. ROC Curve (Recipe)

  22. ROC Curve (Recipe)

  23. ROC Curve (Recipe)

  24. ROC Curve (Recipe)

  25. ROC Curve (Recipe)

  26. ROC Curve (Recipe)

  27. Area Under ROC TP Rate 100% FP Rate 0% 0% 100% For every negative example we sum up the number of positive examples with higher estimate, and normalize this score with a product of positive and negative examples. AROC = P [ P+(positive example) > P+(negative example) ]

  28. Area Under ROC TP Rate 100% • Is expected to be from 0.5 to 1.0 • The score is not affected by class distributions • Characteristic landmarks • 0.5: random classifier • below 0.7: poor classification • 0.7 to 0.8: ok, reasonable classification • 0.8 to 0.9: here is where very good predictive models start FP Rate 0% 0% 100%

  29. Final Thoughts • Never test on the learning set • Use some sampling procedure for testing • At the end, evaluate both • predictive performance • semantical content • Bottom line: good models are those that are useful in practice

More Related