1 / 156

Computational BioMedical Informatics

Computational BioMedical Informatics. SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Course Information. Instructor: Dr. Jinbo Bi Office: ITEB 233 Phone: 860-486-1458 Email: jinbo@engr.uconn.edu

abiola
Download Presentation

Computational BioMedical Informatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational BioMedical Informatics SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept.

  2. Course Information • Instructor: Dr. Jinbo Bi • Office: ITEB 233 • Phone: 860-486-1458 • Email:jinbo@engr.uconn.edu • Web: http://www.engr.uconn.edu/~jinbo/ • Time: Mon / Wed. 2:00pm – 3:15pm • Location: CAST 204 • Office hours: Mon. 3:30-4:30pm • HuskyCT • http://learn.uconn.edu • Login with your NetID and password • Illustration

  3. Review of last chapter • General introduction to the topics in medical informatics, and the data mining techniques involved • Review some basics of probability-statistics • More slides on probability and linear algebra uploaded to huskyCT This class, we start to discuss supervised learning: classification and regression

  4. Regression and classification • Both regression and classification problems are typically supervised learning problems • The main property of supervised learning • Training example contains the input variables and the corresponding target label • The goal is to find a good mapping from the input variables to the target variable

  5. Classification: Definition • Given a collection of examples (training set ) • Each example contains a set of variables (features), and the target variable class. • Find a model for class attribute as a function of the values of other variables. • Goal: previously unseen examples should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

  6. Classification Application 1 Test Set Model categorical categorical continuous Current data, want to use the model to predict class Learn Classifier Training Set Past transaction records, label them Fraud detection – goals: Predict fraudulent cases in credit card transactions.

  7. Classification: Application 2 • Handwritten Digit Recognition • Goal: Identify the digit of a handwritten number • Approach: • Align all images to derive the features • Model the class (identity) based on these features

  8. Illustrating Classification Task

  9. Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Tree Logistic Regression Graphical models

  10. Regression: Definition Goal: predict the value of one or more continuous target attributes give the values of the input attributes Difference between classification and regression only lies in the target attribute Classification: discrete or categorical target Regression: continuous target Greatly studied in statistics, neural network fields.

  11. Refund Marital Taxable Tid Loss Status Income 100 1 Yes Single 125K 120 2 No Married 100K -200 3 No Single 70K -300 4 Yes Married 120K -400 5 No Divorced 95K -500 6 No Married 60K -190 7 Yes Divorced 220K 300 8 No Single 85K Test Set -240 9 No Married 75K Model 90 10 No Single 90K 10 Regression application 1 Continuous target categorical categorical continuous Current data, want to use the model to predict Learn Regressor Training Set Past transaction records, label them goals: Predict the possible loss from a customer

  12. Regression applications • Examples: • Predicting sales amounts of new product based on advertising expenditure. • Predicting wind velocities as a function of temperature, humidity, air pressure, etc. • Time series prediction of stock market indices.

  13. Regression algorithms • Least squares methods • Regularized linear regression (ridge regression) • Neural networks • Support vector machines (SVM) • Bayesian linear regression

  14. Practical issues in the training • Underfitting • Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression

  15. Least squares • We wish to use some real-valued input variables x to predict the value of a target y • We collect training data of pairs (xi,yi), i=1,…N • Suppose we have a model f that maps each x example to a value of y’ • Sum of squares function: • Sum of the squares of the deviation between the observed target value y and the predicted value y’

  16. Least squares • Find a function f such that the sum of squares is minimized • For example, your function is in the form of linear functions f (x) = wTx • Least squares with a linear function of parameters w is called “linear regression”

  17. Linear regression • Linear regression has a closed-form solution for w • The minimum is attained at the zero derivative

  18. Polynomial Curve Fitting • x is evenly distributed from [0,1] • y = f(x) + random error • y = sin(2πx) + ε, ε ~ N(0,σ)

  19. Polynomial Curve Fitting

  20. Sum-of-Squares Error Function

  21. 0th Order Polynomial

  22. 1st Order Polynomial

  23. 3rd Order Polynomial

  24. 9th Order Polynomial

  25. Over-fitting Root-Mean-Square (RMS) Error:

  26. Polynomial Coefficients

  27. Data Set Size: 9th Order Polynomial

  28. Data Set Size: 9th Order Polynomial

  29. Regularization Penalize large coefficient values Ridge regression

  30. Regularization:

  31. Regularization:

  32. Regularization: vs.

  33. Polynomial Coefficients

  34. Classification • Underfitting or Overfitting can also happen in classification approaches • We will illustrate these practical issues on classification problem • Before the illustration, we introduce a simple classification technique – K-nearest neighbor method

  35. K-nearest neighbor (K-NN) • K-NN is one of the simplest machine learning algorithm • K-NN is a method for classifying test examples based on closest training examples in the feature space • An example is classified by a majority vote of its neighbors • k is a positive integer, typically small. If k = 1, then the example is simply assigned to the class of its nearest neighbor.

  36. K-NN K = 1 K = 3

  37. K-NN on real problem data • Oil data set • K acts as a smoother, choosing K is model selection • For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions).

  38. Limitation of K-NN K-NN is a nonparametric model (no any particular function is fitted) Nonparametric models requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation.

  39. Probabilistic interpretation of K-NN Given a data set with Nk data points from class Ck and , we have and correspondingly Since , Bayes’ theorem gives

  40. Underfit and Overfit (Classification) 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5

  41. Underfit and Overfit (Classification) 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5

  42. Number of Iterations Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

  43. Overfitting due to Noise Decision boundary is distorted by noise point

  44. Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task

  45. Notes on Overfitting • Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary • Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records • Need new ways for estimating errors

  46. Occam’s Razor • Given two models of similar generalization errors, one should prefer the simpler model over the more complex model • For complex models, there is a greater chance that it was fitted accidentally by errors in data • Therefore, one should include model complexity when evaluating a model

  47. How to Address Overfitting • Minimize training error no longer guarantees a good model (a classifier or a regressor) • Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y ) • In practice, design a procedure that gives better estimate of the error than training error • In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula

  48. Model Evaluation (pp. 295—304 of data mining) • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models?

  49. Model Evaluation • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models?

  50. Metrics for Performance Evaluation • Regression • Sum of squares • Sum of deviation • Exponential function of the deviation

More Related