Stat 112: Lecture 7 Notes

Stat 112: Lecture 7 Notes • Homework 2: Due next Thursday • The Multiple Linear Regression model (Chapter 4.1) • Inferences from multiple regression analysis (Chapter 4.2)

Interpretation of Regression Coefficients • Gas mileage regression from Car89.JMP

Partial Slopes vs. Marginal Slopes • Multiple Linear Regression Model: • The coefficient is a partial slope. It indicates the change in the mean of y that is associated with a one unit increase in while holding all other variables fixed. • A marginal slope is obtained when we perform a simple regression with only one X, ignoring all other variables. Consequently the other variables are not held fixed.

Partial vs. Marginal Slopes Example

Partial Slopes vs. Marginal Slopes: Another Example • In order to evaluate the benefits of a proposed irrigation scheme in a certain region, suppose that the relation of yield Y to rainfall R is investigated over several years. • Data is in rainfall.JMP.

Higher rainfall is associated with lower temperature.

Rainfall is estimated to be beneficial once temperature is held fixed. Multiple regression provides a better picture of the benefits of an irrigation scheme because temperature would be held fixed in an irrigation scheme.

Inferences about Regression Coefficients • Confidence intervals: confidence interval for : Degrees of freedom for t equals n-(K+1). Standard error of , , found on JMP output. • Hypothesis Test: Decision rule for test: Reject H0 if or where p-value for testing is printed in JMP output under Prob>|t|.

Inference Examples • Find a 95% confidence interval for ? • Is seating of any help in predicting gas mileage once horsepower, weight and cargo have been taken into account? Carry out a test at the 0.05 significance level.

Checking Assumptions: Multiple Linear Regression Model • The expected value of the disturbances is zero for each , • The variance of each is equal to ,i.e., • The are normally distributed. • The are independent.

Plots for Checking Assumptions • We can construct residual plots of each explanatory variable Xk vs. the residuals. We save the residuals by clicking the red triangle next to Response after fitting the model and clicking Save Columns and then residuals. We then plot Xk vs. the residuals using Fit Y by X (where Y=the residuals). We can plot a horizontal line at 0 by using Fit Y by X (it is a property of multiple linear regression that the least squares line for the regression of the residuals on any Xk is a horizontal line. • A useful summary of the residual plots for each explanatory variable is the Residual by Predicted plot that is automatically plotted after using Fit Model. The residual by predicted plot is a plot of the predicted values , , vs. the residuals

Checking Assumptions • Linearity: • Check that in residual by predicted plot, the mean of the residuals for each range of the predicted values is about zero. • Check that in each residual plot, the mean of the residuals for each range of the explanatory variable is about zero. • Constant Variance: Check that in the residual by predicted plot that for each range of the predicted values, the spread of the residuals is about the same. • Normality: Plot histogram of the residuals. Check that the histogram is bell shaped.

Residual by predicted plot does not suggest and suggests approximately constant variance Plot of horsepower vs. residuals suggests linearity is okay. Plot of weight vs. residuals suggests linearity is okay. One potential concern is that highest weight cars all have negative residuals.

Plot of residuals vs. horsepower suggest linearity is okay. Highest 4 horsepower cars all have negative residuals but next 5 highest horsepower cars all have positive residuals. Plot of residuals vs. seating suggests linearity is not perfect for seating. Residuals for small and high seating seem to have a mean that is smaller than 0.

Coefficient of Determination • The coefficient of determination for multiple regression is defined as for simple linear regression: • Represents percentage of variation in y that is explained by the multiple regression line. • is between 0 and 1. The closer to 1, the better the fit of the regression equation to the data.

Assessing Quality of Prediction (Chapter 3.5.3) • R squared measures is a measure of a fit of the regression to the sample data. It is not generally considered an adequate measure of the regression’s ability to predict the responses for new observations. • One method of assessing the ability of the regression to predict the responses for new observations is data splitting. • We split the data into a two groups – a training sample and a holdout sample (also called a validation sample). We fit the regression model to the training sample and then assess the quality of predictions of the regression model to the holdout sample.

College Data in collegeclass.JMP • Training Sample: 40 observations. • Holdout Sample: Last 10 observations. • Mean Squared Deviation: Mean squared prediction error over the holdout sample over the n2 (=10 here) observations in the holdout sample.

Stat 112: Lecture 7 Notes