Computational BioMedical Informatics

Computational BioMedical Informatics SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept.

Course Information • Instructor: Dr. Jinbo Bi • Office: ITEB 233 • Phone: 860-486-1458 • Email:jinbo@engr.uconn.edu • Web: http://www.engr.uconn.edu/~jinbo/ • Time: Mon / Wed. 2:00pm – 3:15pm • Location: CAST 204 • Office hours: Mon. 3:30-4:30pm • HuskyCT • http://learn.uconn.edu • Login with your NetID and password • Illustration

Review of last chapter • General introduction to the topics in medical informatics, and the data mining techniques involved • Review some basics of probability-statistics • More slides on probability and linear algebra uploaded to huskyCT This class, we start to discuss supervised learning: classification and regression

Regression and classification • Both regression and classification problems are typically supervised learning problems • The main property of supervised learning • Training example contains the input variables and the corresponding target label • The goal is to find a good mapping from the input variables to the target variable

Classification: Definition • Given a collection of examples (training set ) • Each example contains a set of variables (features), and the target variable class. • Find a model for class attribute as a function of the values of other variables. • Goal: previously unseen examples should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Application 1 Test Set Model categorical categorical continuous Current data, want to use the model to predict class Learn Classifier Training Set Past transaction records, label them Fraud detection – goals: Predict fraudulent cases in credit card transactions.

Classification: Application 2 • Handwritten Digit Recognition • Goal: Identify the digit of a handwritten number • Approach: • Align all images to derive the features • Model the class (identity) based on these features

Illustrating Classification Task

Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Tree Logistic Regression Graphical models

Regression: Definition Goal: predict the value of one or more continuous target attributes give the values of the input attributes Difference between classification and regression only lies in the target attribute Classification: discrete or categorical target Regression: continuous target Greatly studied in statistics, neural network fields.

Refund Marital Taxable Tid Loss Status Income 100 1 Yes Single 125K 120 2 No Married 100K -200 3 No Single 70K -300 4 Yes Married 120K -400 5 No Divorced 95K -500 6 No Married 60K -190 7 Yes Divorced 220K 300 8 No Single 85K Test Set -240 9 No Married 75K Model 90 10 No Single 90K 10 Regression application 1 Continuous target categorical categorical continuous Current data, want to use the model to predict Learn Regressor Training Set Past transaction records, label them goals: Predict the possible loss from a customer

Regression applications • Examples: • Predicting sales amounts of new product based on advertising expenditure. • Predicting wind velocities as a function of temperature, humidity, air pressure, etc. • Time series prediction of stock market indices.

Regression algorithms • Least squares methods • Regularized linear regression (ridge regression) • Neural networks • Support vector machines (SVM) • Bayesian linear regression

Practical issues in the training • Underfitting • Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression

Least squares • We wish to use some real-valued input variables x to predict the value of a target y • We collect training data of pairs (xi,yi), i=1,…N • Suppose we have a model f that maps each x example to a value of y’ • Sum of squares function: • Sum of the squares of the deviation between the observed target value y and the predicted value y’

Least squares • Find a function f such that the sum of squares is minimized • For example, your function is in the form of linear functions f (x) = wTx • Least squares with a linear function of parameters w is called “linear regression”

Linear regression • Linear regression has a closed-form solution for w • The minimum is attained at the zero derivative

Polynomial Curve Fitting • x is evenly distributed from [0,1] • y = f(x) + random error • y = sin(2πx) + ε, ε ~ N(0,σ)

Polynomial Curve Fitting

Sum-of-Squares Error Function

0th Order Polynomial

1st Order Polynomial

3rd Order Polynomial

9th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Data Set Size: 9th Order Polynomial

Regularization Penalize large coefficient values Ridge regression

Regularization:

Regularization: vs.

Polynomial Coefficients

Classification • Underfitting or Overfitting can also happen in classification approaches • We will illustrate these practical issues on classification problem • Before the illustration, we introduce a simple classification technique – K-nearest neighbor method

K-nearest neighbor (K-NN) • K-NN is one of the simplest machine learning algorithm • K-NN is a method for classifying test examples based on closest training examples in the feature space • An example is classified by a majority vote of its neighbors • k is a positive integer, typically small. If k = 1, then the example is simply assigned to the class of its nearest neighbor.

K-NN K = 1 K = 3

K-NN on real problem data • Oil data set • K acts as a smoother, choosing K is model selection • For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions).

Limitation of K-NN K-NN is a nonparametric model (no any particular function is fitted) Nonparametric models requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation.

Probabilistic interpretation of K-NN Given a data set with Nk data points from class Ck and , we have and correspondingly Since , Bayes’ theorem gives

Underfit and Overfit (Classification) 500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5

Number of Iterations Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise Decision boundary is distorted by noise point

Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task

Notes on Overfitting • Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary • Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records • Need new ways for estimating errors

Occam’s Razor • Given two models of similar generalization errors, one should prefer the simpler model over the more complex model • For complex models, there is a greater chance that it was fitted accidentally by errors in data • Therefore, one should include model complexity when evaluating a model

How to Address Overfitting • Minimize training error no longer guarantees a good model (a classifier or a regressor) • Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y ) • In practice, design a procedure that gives better estimate of the error than training error • In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula

Model Evaluation (pp. 295—304 of data mining) • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models?

Model Evaluation • Metrics for Performance Evaluation • How to evaluate the performance of a model? • Methods for Performance Evaluation • How to obtain reliable estimates? • Methods for Model Comparison • How to compare the relative performance among competing models?

Metrics for Performance Evaluation • Regression • Sum of squares • Sum of deviation • Exponential function of the deviation

Computational BioMedical Informatics

Computational BioMedical Informatics

Presentation Transcript

Personalized Biomedical Informatics

Biomedical Informatics

Computational BioMedical Informatics

Computational informatics

Biomedical Informatics Program

Biomedical Informatics Group (UPM)

Biomedical Informatics Research Network

Personalized Biomedical Informatics

Biomedical Informatics Research Network

What is “Biomedical Informatics”?

Computational Informatics

BioMedical Informatics Core Update

BIOMEDICAL INFORMATICS RESEARCH

Biomedical Informatics Hub

Biomedical informatics for proteomics

Biomedical Informatics

Biomedical Informatics Group (UPM)

Biomedical Informatics Core

Biomedical Informatics

Biomedical Engineering and Biomedical Informatics Program