CSI5388: A Critique of our Evaluation Practices in Machine Learning

CSI5388:A Critique of our Evaluation Practices in Machine Learning

Observations • The way in which Evaluation is conducted in Machine learning/Data Mining has not been a primary concern in the community. • This is very different from the way Evaluation is approached in other applied fields such as: Economics, Psychology and Sociology. • In such fields, researchers have been more concerned with the meaning and validity of their results than in ours.

The Problem • The objective value of our advances in Machine Learning may be different from what we believe it is. • Our conclusions may be flawed or meaningless. • ML methods may get undue credit or not get sufficiently recognized. • The field may start stagnating. • Practitioners in other fields or potential business partners may dismiss our approaches/results. • We hope that with better evaluation practices, we can help the field of machine learning focus on more effective research and encourage more cross-discipline or cross-purposes exchanges.

Organization of theLecture A review of the shortcomings of current evaluation methods: • Problems with Performance Evaluation • Problems with Confidence Estimation • Problems with Data Sets

Recommended Steps for Proper Evaluation • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly • Choose a confidence estimation method . • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain.

Commonly Followed Steps of Evaluation • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly • Choose a confidence estimation method . • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain. These steps are typically considered, but only very lightly .

Overview • What happens when bad choices of performance evaluation metrics are made? (Steps 1 and 2 are considered too lightly) • Accuracy • Precision/Recall • ROC Analysis • Note: each metric solves the problem of the previous one, but introduces new shortcomings (usually caught by the previous metrics) • What happens when bad choices of confidence estimators are made and the assumptions underlying these confidence estimator are not respected (Steps 3 is considered lightly and Step 4 is disregarded). • The t-test E.g., E.g.,

A Short Review I: Confusion Matrix / Common Performance evaluation Metrics • Accuracy = (TP+TN)/(P+N) • Precision = TP/(TP+FP) • Recall/TP rate = TP/P • FP Rate = FP/N • ROC Analysis moves the threshold between the positive and negative class from a small FP rate to a large one. It plots the value of the Recall against that of the FP Rate at each FP Rate considered. A Confusion Matrix

A Short Review II: Confidence Estimation / The t-Test • The most commonly used approach to confidence estimation in Machine learning is: • To run the algorithm using 10-fold cross-validation and to record the accuracy at each fold. • To compute a confidence interval around the average of the difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula: δ +/- tN,9 * sδwhere • δ is the average difference between the reported accuracy and the given gold standard, • tN,9 is a constant chosen according to the degree of confidence desired, • sδ= sqrt(1/90 Σi=110 (δi – δ)2) where δi represents the difference between the reported accuracy and the given gold standard at fold i.

What’s wrong with Accuracy? • Both classifiers obtain 60% accuracy • They exhibit very different behaviours: • On the left: weak positive recognition rate/strong negative recognition rate • On the right: strong positive recognition rate/weak negative recognition rate

What’s wrong with Precision/Recall? • Both classifiers obtain the same precision and recall values of 66.7% and 40% • They exhibit very different behaviours: • Same positive recognition rate • Extremely different negative recognition rate: strong on the left / nil on the right • Note: Accuracy has no problem catching this!

What’s wrong with ROC Analysis?(We consider single points in space: not the entire ROC Curve) • ROC Analysis and Precision yield contradictory results • In terms of ROC Analysis, the classifier on the right is a significantly better choice than the one on the left. [the point representing the right classifier is on the same vertical line but 22.25% higher than the point representing the left classifier] • Yet, the classifier on the right has ridiculously low precision (33.3%) while the classifier on the left has excellent precision (95.24%).

What’s wrong with the t-test? • Classifiers 1 and 2 yield the same average mean and confidence interval. • Yet, Classifier 1 is relatively stable, while Classifier 2 is not. • Problem: the t-test assumes a normal distribution. The difference in accuracy between classifier 2 and the gold-standard is not normally distributed

Discussion • There is nothing intrinsically wrong with any of the performance evaluation measures or confidence tests discussed. It’s all a matter of thinking about which one to use when, and what the results means (both in terms of added value and limitations). • Simple conceptualization of the Problem with current evaluation practices: • Evaluation Metrics and Confidence Measures summarize the results  ML Practitioners must understand the terms of these summarizations and verify that their assumptions are verified. • In certain cases, however, it is necessary to look further and, eventually, borrow practices from other disciplines. In, yet, other cases, it pays to devise our own methods. Both instances are discussed in what follows.

CSI5388: A Critique of our Evaluation Practices in Machine Learning