800 likes | 1.09k Views
November 13, 2013. Collect model fits for 4 problems Return reports VIFs Launch chapter 11. Grocery Data Assignment.
E N D
November 13, 2013 • Collect model fits for 4 problems • Return reports • VIFs • Launch chapter 11
Grocery Data Assignment X3 (holiday) and X1 (cases shipped) do the job and X2 adds nothing; normality and variance constant okay; no need for quadratic terms or interactions; no need at all to square X3 (two level factor)
Some notes • State conclusion (final model) up front • Report model with fit of X1 and X3 rather than from a fit with X1, X2, X3 and just drop X2 • Check assumptions for X1, X3 model not model with X2 as well • Box-Cox suggests no transformation even though l = 2 is “best” • Interaction bit
Notes on writing aspects • Avoid imperative form of verbs (Fit the multivariate. Run the model. Be good.(You) Verb … • Don’t use contractions. It’s bad form. They’re considered informal. • Spell check does not catch wrong words (e.g., blow instead of below, not instead of note) • Writing skills are important (benefits considerable)
VIFs (not BFFs) • Variance Inflation Factors • (1-R2)-1 where R2 is the R-squared from regressing Xion the other Xi’s • Available in JMP if you know where to look
x3 appears to be both a “dud” for predicting y and not very collinear with either x1 or x2
In computing VIFs, need to regress x3 on x1 and x2, and compute 1/(1-RSquare) which here is about 100. What gives?
Added variable plot to the rescue Note 1.6084963 matches earlier value (Suggest you run the other way as well.) Regress x3 and x2 on x1 and save residuals
Interesting body fat example • Looks like very little co-linearity • However, massive multi-collinearity! • This is why it can be challenging at times!!! • Why we don’t throw extra dud variables into the model
Steps in the analysis • Multivariate to get acquainted with data • (Analyze distribution all variables) • Looking for a decent model—parsimonious • Linear, interactions, quadratic • Stepwise if many variables • PRESS vs. root mean square error • Added variable plots according to taste • Check assumptions (lots of plots) • Check for outliers, influential observations (hats and Cook’s Di)
Chapter 11: Remedial Measures We’ll cover in some detail: 11.1 Weighted Least Squares 11.2 Ridge regression 11.4 Regression trees 11.5 Bootstrapping
11.1 Weighted Least Squares Suppose that the constant variance assumption does not hold. Each residual has a different variance but keep the zero covariances: Least squares is out—what should we do?
Use Maximum Likelihood for inspiration! Likelihood: Now define the i-th weight to be: Then the likelihood is:
Taking logarithms, we get: Log Likelihood is a constant plus: Criterion is same as least squares, except each squared residual is weighted by wi -- hence the weighted least squares criterion. The coefficient vector bw that minimizes Qw is the vector of weighted least squares estimates
Matrix Approach to WLS Let: Then:
Usual Case: Variances are Unknown Need to estimate each variance! Recall: Give a statistic that can estimate si2: Give a statistic that can estimatesi:
Estimating a Standard Deviation Function Step 1: Do ordinary least squares; obtain residuals Step 2: Regress the absolute values of the residuals against Y or whatever predictor(s) seem to be associated with changes in the variances of the residuals. Step 3: Use the predicted absolute residual for case i, |ei| as the estimated variance of ei, call it si Step 4: Then wi = (1/si)2 ^ ^ ^ ^
Subset x and y for table 11.1 • Fit y on x and save residuals, compute absolute value of residuals • Regress these residuals on x • The predicted values are estimated stan. Dev.’s • Weights are reciprocal of stan. Dev. Squared • Use these weights with WLS on original y and x variables to get y-hat = 55.566 + .5963 x
Notes on WLS Estimates • WLS estimates are minimum variance, unbiased. • If you use Ordinary Least Squares (OLS) when variance is not constant, estimates are still unbiased, just not minimum variance. • If you have replicates at each unique X category, you can just use the sample standard deviation of the responses at each category to determine the weight for any response in the category. • R2 has no clear cut meaning here. • Must use the standard deviation function value (instead of s) for confidence intervals for prediction
11.2 Ridge Regression Biased regression to reduce the effect of multicollinearity. Shrinkage estimation: Reduce the variance of the parameters by shrinking them (a bit) in absolute magnitude. This will introduce some bias, but may reduce the MSE overall. Recall: MSE = bias squared plus variance:
How to Shrink? Penalized least squares! Start with standardized regression model: Add a “penalty” proportional to the total size of the parameters (proportionality or biasing constant is c):
Matrix Ridge Solution Start with small c and increase (iteratively) until the coefficients stabilize. Plot is called “ridge trace” Here, use c about equal to .02
Decision trees… section 11.4 • Kddnuggets.com suggests… • A bit of history • Breiman et al. • Respectability • Oldie but goodie slides SA • Titanic data • Some bootstrapping stuff (probably not tonight)
Data mining and Predictive Modeling • Predictive modeling mainly involves the application of: • Regression • Logistic regression • Regression trees • Classification trees • Neural networks • to very large data sets. • The difference, technically, is because we have so much data, we can rely on the use of validation techniques---the use of training, validation and test sets to assess our models. There is much less concern about: • Statistical significance (everything is significant!) • Outliers/influence (a few outliers have no effect) • Meaning of coefficients (models may have thousands of predictors) • Distributional assumptions, independence, etc.
Data mining and Predictive Modeling We will talk about some of the statistical techniques used in predictive modeling, once the data have been gathered, cleaned, organized. But data gathering usually involves merging disparate data from different sources, data warehouses, etc., and usually represents at least 80% of the work. General Rule: Your organization’s data warehouse will not have the information you need to build your predictive model. (Paraphrased, Usama Fayyad, VP data, Yahoo)
Regression Trees Idea: Can we cut up the predictor space into rectangles such that the response is roughly constant in each rectangle, but the mean changes from rectangle to rectangle? We’ll just use the sample average (Y) in each rectangle as our predictor! Simple, easy-to-calculate, assumption-free, nonparametric regression method. Note there is no “equation.”The predictive model takes the form of a decision tree. _
Steroid Data • See file ch11ta08steroidSplitTreeCalc.jmp • Overall Average of y is 17.64; SSE is 1284.8
Example: Steroid Data Predictive Model _ Fit 13.675 16.95 22.2 3.55 8.133 ^ Example: What is Y at Age = 9.5?
How do we find the regions (i.e., grow the tree)? For one predictor X, it’s easy. Step 1: To find the first split point Xs, make a grid of possible split points along the X axis. Each possible split point divides the X axis into two regions R21 and R22. Now compute SSE for the two-region regression tree: Do this for every grid point X. The point that leads to the minimum SSE is the split point. Steps 2: If you now have r regions, determine the best split point for each of the r regions as you did in step 1; choose the one that leads to the lowest SSE for the r + 1 regions. Steps 3: Repeat Step 2 until SSE levels off (more later on stopping)
Illustrate first split with Steroid Data • See file ch11ta08steroidSplitTreeCalc.jmp • Overall Average of y is 17.64; SSE is 1284.8
In the JMP file, aforementioned… • Point out calculations needed to determine optimal first split • Easy but a bit tedious • Binary vs. multiple splits • Run it in JMP, be sure to set min # in splits • Fit conventional model as well…
Growing the Steroid Level Tree Split 1 Split 2 Split 3 Split 4
When do we stop growing? If you let the growth process go on forever, you’ll eventually have n regions each with just one observation. The mean of each region is the value of the observation, and R2 = 100%. (You fitted n means (parameters) and so you have n – n = 0 degrees of freedom for error). Where to stop?? We do this by data-splitting and cross-validation. After each split, use your model (tree) to predict each observation in a hold-out sample and compute MSPR or R2 (holdout) . As we saw with OLS regression, MSPR will start to increase (R2 for holdout will decrease) when we overfit. We can rely on this because we have very large sample sizes.
What about multiple predictors? For two or more predictors, no problem. For each region, we have to determine the best predictor to split on AND the best split point for that predictor. So if we have p – 1 predictors, and at stage r we have r regions, there are r(p – 1) possible split points. Example: Three splits for two predictors
Using JMP for Regression Trees • Analyze >> Modeling >> Partition • Exclude at least 1/3 for validation sample using: Rows >> Row Selection >> Select Randomly; then Rows >> Exclude • JMP will automatically give the predicted R2 value (1 – SSE/SSTO for the validation set) • You need to manually call for a split (doesn’t fit the tree automatically)
Split button Note: R2 for hold-out sample As you grow the tree this value will peak and begin to decline! Clicking the red triangle gives options: select “split history” to see a plot of predicted R2 vs. number of splits
Classification Trees • Regression tree equivalent of logistic regression • Response is binary 0-1; average response in each region is now p, not Y • For each possible split point, instead of SSE, we compute the G2 statistic for the resulting 2 by r contingency table. Split goes to the smallest value. (Can also use the negative of the log(p-value), where the p-value is adjusted in a Bonferroni-like manner. This is called the “Logworth” statistic. Again, you want a smallvalue.)