1 / 70

Chapter 4 Regression Topics

Chapter 4 Regression Topics. Credits : Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth notes. Regression Review. Linear Regression models a numeric outcome as a linear function of several predictors. It is the king of all statistical and data mining models ease of interpretation

deva
Download Presentation

Chapter 4 Regression Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4 Regression Topics Credits: Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth notes Data Mining - Massey University

  2. Regression Review • Linear Regression models a numeric outcome as a linear function of several predictors. • It is the king of all statistical and data mining models • ease of interpretation • mathematically concise • tends to perform well for prediction, even under violations of assumptions Data Mining - Massey University

  3. Regression Review • We will focus on regression as a predictive task - • Characteristics • numeric response - ideally real valued • numeric predictors- but not necessarily • Goals of regression analysis for data mining • explanation - which variables are most important and which are not needed • prediction • inference (significance and C.I. of predictors) not a focus • interactions of variables Data Mining - Massey University

  4. Examples of Regression tasks • Credit scoring • gas mileage of cars • how much money will a customer spend? • what factors are important for high cholesterol • Predicting yields of a crop • what strategies result in high scores in baseball (or cricket!) Data Mining - Massey University

  5. Example: Prostate Cancer • Data set ‘prostate.txt’ • Predicting the prostate-specific antigen • log-cancer volume (lcavol) is the response variable • predictors: • prostrate weight (weight) • age • benign prostatic hyperplasia (lbph) • capsular penetration (lcp) • Gleason score (gleason) • percent Gleason of 4 or 5 (pgg45) Data Mining - Massey University

  6. Data Mining - Massey University

  7. Data Mining - Massey University

  8. Prostate Data • using summary() and cor() to look at data Data Mining - Massey University

  9. Linar Regression Model • Basic model: • you are not modelling y, but you are modelling the mean of y for a given x! • Simple Regression - one x. • easy to describe, good for mathematics, but not used often in data mining • Multiple regression - many x - • response surface is a plane…harder to conceptualize Data Mining - Massey University

  10. Linear Regression Model • Assumptions: • linearity • constant variance • normality of errors • residuals ~ Normal(mu,sigma^2) • Assumptions must be checked, • but if inference is not the goal, you can accept some deviation from assumptions (don’t’ tell the statisticians I said that!) • Multicollinearity also an issue • creates unstable estimates Data Mining - Massey University

  11. Fitting the Model • We can look at regression as a matrix problem • We want a score function which minimizes “a”: = which is minimized by predicton follows easily: Data Mining - Massey University

  12. Comments on Multivariate Linear Regression • prediction is a linear function of the parameters • Model structure is simple…. • p-1 dimensional hyperplane in p-dimensions • Linear weights => interpretability • Useful as a baseline model • to compare more complex models to Data Mining - Massey University

  13. Limitations of Linear Regression • True relationship of X and Y might be non-linear • Suggests generalizations to non-linear models • Correlation/Collinearity among the X variables • Can cause numerical instability • Problems in interpretability (identifiability) • Includes all variables in the model… • But what if p=100 and only 3 variables are related to Y? Data Mining - Massey University

  14. Regression fit to Prostate data Data Mining - Massey University

  15. Diagnostic Plots Data Mining - Massey University

  16. Data Mining - Massey University

  17. Checking assumptions • linearity • look to see if transformations make relationships ‘more’ linear • normality of errors • diagnostic plots help show patterns of ‘opening’ variance or other strange behavior • influence • highly ‘influential’ cases have undue impact on the analysis Data Mining - Massey University

  18. Simplest way to check assumptions: • Plot of residuals vs. fits • A scatter plot with residuals on the y axis and fitted values on the x axis. • Helps to identify non-linearity, outliers, and non-constant variance. Data Mining - Massey University

  19. A well-behaved residuals vs. fits plot • The residuals “bounce randomly” around the 0 line. (Linearity is reasonable). • No one residual “stands out” from the basic random pattern of residuals. (No outliers). • The residuals roughly form a “horizontal band” around 0 line. (Constant variance). Data Mining - Massey University

  20. Detecting Violations of Linearity Data Mining - Massey University

  21. How a non-linear function shows up on a residual vs. fits plot • The residuals depart from 0 in some systematic manner: • such as, being positive for small x values, negative for medium x values, and positive again for large x values Data Mining - Massey University

  22. Corrections for Linearity Violations • Finding the right correction is often not obvious • We need a curvilinear relationship between x and y • Problem: there are many different possible curvilinear relationships • Polynomials, exponential, logarithmic, sinusoidal, ... • Approaches: trial-and-error, gut feeling, experience, domain knowledge, etc. Data Mining - Massey University

  23. Detecting Violations of Non-constant Variance Data Mining - Massey University

  24. How non-constant error variance shows up on a residual vs. fits plot • The plot has a “fanning” effect. • Residuals are close to 0 for small x values and are more spread out for large x values. • Or, the spread of the residuals can vary in some more complex fashion. Data Mining - Massey University

  25. Corrections for Non-constant Variance • Transformation of the dependent variable y • Problem: which transformation to use • Logarithmic, square-root transformations often alleviate the “fanning effect” • Make large values much smaller; leave small values unchanged • To find the right transformation: trial-and-error, gut feeling, experience, domain knowledge, etc. Data Mining - Massey University

  26. Detecting Violations of Independence Data Mining - Massey University

  27. Residuals vs. order plot • Helps assess serial correlation (a form of non-independence) of error terms. • If the data are obtained in a time (or space) sequence, this plot helps to see if there is any correlation between the error terms that are near each other in the sequence. • It’s only appropriate if you know the order in which the data were collected! Data Mining - Massey University

  28. Normal random noise Data Mining - Massey University

  29. A time trend Data Mining - Massey University

  30. Positive serial correlation Residuals tend to be followed, in time, by residuals of the same sign and about the same magnitude. Data Mining - Massey University

  31. Negative serial correlation Residuals of one sign tend to be followed, in time, by residuals of the opposite sign. Data Mining - Massey University

  32. Corrections for Independence Violations • Modeling the autocorrelation either explicitly or implicitly • simple way • if autocorrelation is simple lag-1 or seasonal: • remove the main effect and model the residuals. • more complex way • directly model the autocorrelation through time series model • ARIMA models can tell you if there are periodic effects and model them • R Functions arima(), acf() Data Mining - Massey University

  33. Detecting Violations of Normality Data Mining - Massey University

  34. Normal (probability) plot of residuals • Helps assess normality of error terms. • If data follow a normal distribution with mean m and variance s2, then a plot of percentiles of the normal distribution versus sample percentiles should be approximately linear. Data Mining - Massey University

  35. Another example: Normal residuals Data Mining - Massey University

  36. Another example: Normal residuals but with one outlier Data Mining - Massey University

  37. Another example: Skewed (positive) residuals Data Mining - Massey University

  38. Another example: Heavy-tailed residuals Data Mining - Massey University

  39. Corrections for Normality Violations • Corrections for skewness, heavy tails • Transformations of the response variable • Box-Cox transformation, logarithm, etc • The goal is to make the distribution shaped almost like a bell • Corrections for bi-modality • Break data up into two (or more) clusters • Alternatively, use mixture models for analysis • Corrections for outliers • Remove outlier, if it is an invalid point • Erroneous data entry, wrong population, etc • Fit different functional relationship, if it is a valid point • Outlier may suggest curvilinear relationship, rather than linear one Data Mining - Massey University

  40. Checking assumptions • Influence • H is called the hat matrix: • The element of H for a given observation is its influence • The leverage hi quantifies the influence that the observed response yi has on its predicted value y • It measures the distance between the X values for the ith case and the means of the X values for all n cases. • influence hi is a number between 0 and 1 inclusive. lets see a case where it goes wrong… Data Mining - Massey University

  41. Scottish running data distance climb time 1 2.5 650 16.05 2 6.0 2500 48.21 3 6.0 900 33.39 4 7.5 800 45.36 5 8.0 3070 62.16 6 8.0 2866 73.13 … Data Mining - Massey University

  42. Data Mining - Massey University

  43. Residual plots in R • “hist” to create histograms • “qq.plot” “qqplot” or “qqnorm” to create normal probability plots • “plot” to create regular scatterplots • Getting the residuals and fitted values: • Output regression model in the form “reg = lm(y~x) • Call residuals via reg$res • Call fitted values via reg$fit Data Mining - Massey University

  44. Interpretation of Results • Parameter estimates: • if the jth predictor variable, xj is increased by one unit, while all the other predictor variables are kept fixed, then the response variable y will increase by aj. • conditional effect of each one, holding all others constant. • size of effect (not significance) depends on units • Multiple correlation coefficient - R2 • measures a ratio between • regression sum of squares - how much of the variance does the regression explain, and • the total sum of squares - how much variation is there altogether • If it is close to 1, your fit is good. But be careful. Data Mining - Massey University

  45. Model selection: finding the best k variables • If noisy variables are included in the model, it can effect the overall performance. • Best to remove an predictors which have no effect, lest random patterns look significant. • How do we search over 2p models? • Heuristic search is used to search over model space: • Forward search (greedy) • Backward search (greedy) • Generalizations (add or delete) • Think of operators in search space • Branch and bound techniques (package ‘leaps’) • score function has to penalize for complexity, or can use cross validation • This type of variable selection problem is common to many data mining algorithms • Outer loop that searches over variable combinations • Inner loop that evaluates each combination Data Mining - Massey University

  46. Data Mining - Massey University

  47. Stepwise Cautions • Stepwise tends to be conservative, but still can remove good variables due to its greedy search • somewhat arbitrarily deals with multicorrelation • interpretation of variable A changes when variable B disappears. • elaborate techniques tend to overfit the data • can help this by using cross-validation Data Mining - Massey University

  48. Generalizing Linear Regression Data Mining - Massey University

  49. Complexity versus Goodness of Fit Training data y x Data Mining - Massey University

  50. Complexity versus Goodness of Fit Too simple? Training data y y x x Data Mining - Massey University

More Related