420 likes | 612 Views
Additional Reading. For additional reading see Chapter 15 and Chapter 14 in Michael R. Middleton's Data Analysis Using Excel, Duxbury Thompson Publishers, 2000.Example described in this lecture is based in part on Chapter 17 and Chapter 18 of Keller and Warrack's Statistics for Management and Economics. Fifth Edition, Duxbury Thompson Learning Publisher, 2000..
E N D
1. Multiple Regression Farrokh Alemi, Ph.D.
Kashif Haqqi M.D.
2. Additional Reading For additional reading see Chapter 15 and Chapter 14 in Michael R. Middleton’s Data Analysis Using Excel, Duxbury Thompson Publishers, 2000.
Example described in this lecture is based in part on Chapter 17 and Chapter 18 of Keller and Warrack’s Statistics for Management and Economics. Fifth Edition, Duxbury Thompson Learning Publisher, 2000.
3. Objectives To learn the assumptions behind and the interpretation of multiple variable regression.
To use Excel to calculate multiple regression.
To test hypothesis using multiple regression.
4. Multiple Regression Model We assume that k independent variables are potentially related to the dependent variable using the following equation:
The objective is to find such that the difference between y and is minimized if:
5. Similarity With Single Variable Regression Same method of finding best fit with data by minimizing sum of square of residuals
Same assumptions regarding Normal distribution of residuals and constant standard deviation of residuals
New issues related to finding optimal combination of variables that can predict response variable
6. Multiple Regression in Excel Arrange y and x variables as columns with each case as a row
Select tools, data analysis, regression
Enter the range for Y variable
Enter the range for all X values
Select output range and at a minimum select for output normal plot and residual plots
7. Example Examine which variable affects the profitability of health centers. Download data
Regress profit measure (profit divided by revenue) on:
Number of visits
Maximum distance among clinics in the center
Number of employers in the area
Percent of community enrolled in college
Median income of community in thousands
Distance to downtown
8. Regression Statistics 49% of variance in Y is explained by the regression equation
9. ANOVA for Regression Null hypothesis MSR is equal to MSE
Alternative hypothesis MSR is greater than MSE
F statistic is 17 with probability of 0.00 to be observed under null hypothesis
Null hypothesis is rejected
10. Analysis of Coefficients Null hypotheses: Coefficients are zero
Alternative hypothesis: coefficients are different from zero
Are P values below 0.05?
All null hypotheses are rejected except college enrollment and distance to downtown
11. Discussion of Direction of Coefficients One visit to the competitor decreases operating margin by 0.01
One more mile distance among clinics decreases operating margin by 1.65
One more employer in the community increases operating margin by 0.02
One thousand dollars more income decreases operating margin by 0.41
12. Check Assumptions Does the residual have a Normal distribution?
Is variance of residuals is constant?
Are errors independent?
Are there observations that are inaccurate or do not belong to the target population?
13. Does the Residual Have a Normal Distribution? Plot the Normal Probability Plot
It looks near Normal. The assumption seems reasonable
14. Is the Variance of Residuals Constant? Residuals seem randomly distributed
Range of the residuals at particular values of visits to competitors seems similar
15. Is the Variance of Residuals Constant? Residuals seem randomly distributed
Range of the residuals at particular values of visits to competitors seems similar
16. Is the Variance of Residuals Constant?
17. Is the Variance of Residuals Constant?
18. What if assumptions are violated? Consider non-linear regression (see options under trend-line)
Transform the response variable, instead of Y use one of the following that best corrects the problem:
Log of y
y to power of a constant, e.g.
Reciprocal of y or 1/y
19. Example of Violation of Assumption Suppose in regression of Y on X we observed the plot to the right
Variance of residual depends on values of X
20. Correcting the Violation Create a new column named “transformed Y” which is the log of y
Repeat regression
The variability in the variance at different levels of x is reduced
21. What to Do If Variables Are Non Linear? Use nonlinear regression (see trend line command)
Use Linear regression
Transform the x variable and create a new column of data. Choose transformations based on the shape of the relationship you see in the data
Use x to power of a constant
Use log of x
Use reciprocal of x
Use the transformed column of data in the linear regression
22. Relationship Among Regression Components
23. Multicollinearity Problem in interpretation of regression coefficients when independent variables are correlated
24. Sample Problem Download data
Construct a measure of severity of substance abuse to predict length of stay of patients in treatment programs. The more the severity the shorter the stay.
30 patients were followed and their length of stay as well as their scores on 10 co-morbidities were recorded. Higher score indicates more of the factor is present.
25. Correlations Between Response and Independent Variables Length of stay is related to individual variables.
4 to 46 percent of variance would be explained by a single variable.
For example, regress length of stay on the teen or elderly variable. The R2 explained is significant at alpha levels lower than 0.01.
26. Multiple Regression Note that adjusted R2 measures the percent of variance in Y (length of stay) explained.
31% is explained by the linear combination of the variables.
27. ANOVA Statistics We cannot reject the hypothesis that variation explained by regression is higher than the variation in residuals.
The regression model is not effective.
28. Test of coefficients Null hypothesis: Coefficient is zero
Alternative hypothesis: coefficient is different from zero
None of the null hypotheses can be rejected
29. But if we look at it in single variable regressions … Regress length of stay on teen or elderly variable
Coefficient is statistically significant
30. Why would a single variable relationship disappear when looking at it in a multiple regression?
31. Explanation of Collinearity Collinearity exists when independent variables are correlated
Collinearity increases sample variability and SSE
Previously significant relationships no longer are significant when they enter into the equation with other collinear variables
Conceptually the percent of variance explained by colinear independent variables is shared among the independent variables
32. Detection of Collinearity Exists in almost all situation except when full factorial designs are used to set up the experiment
The key question is how much collinearity is too much
A heuristic is that correlations above 0.30 are problematic
33. Correlations
34. How to Correct for Collinearity? Choose independent variable with low collinearity
Use stepwise regression
A procedure in which the most correlated variable is entered into the equation first, then remaining variance in Y is explained by the next most correlated variable and so on. In this procedure the order of entry of the variables matter.
35. Non Interval Independent Variables Some time the independent variable is measured on an ordinal or nominal scale e.g., gender
To use regression assign 0 to absence of the variable and 1 to presence and use this indicator variable in your regression analysis
If more than two levels use multiple indicator variables one for each level except for the reference level
36. Example of An Indicator Variable Type of clinician includes the following levels: psychiatrist, psychologist, counselor, social worker Use 3 indicator variables:
Presence of psychiatrist
Presence of psychologist
Presence of counselor
When all three are not present then it is assumed that the clinician was a social worker
37. Another Example of An Indicator Variable Patients diagnoses may be any of the following levels
No MI
MI
MI with complications Use 2 indicator variables:
Presence of MI
Presence of MI with complications
When both indicators are zero, then diagnoses is no MI
38. Test for Interactions Consider two variables x1 and x2. We had assumed
Sometimes there are interactions in first order linear models so that we can look at
Multiply x1 column of data with x2 column of data and put into a separate column
39. Example Anger and Abused is an independent variable
A new variable is created named angry and abused which is the multiplication of these two variables
Note that the new variable is 0 when any of the two components area zero
40. Regression with Interaction Terms Include the new column in the regression
Not that being abused is not related to length of stay
Note that being angry and abused is related
41. Test for Interactions (Continued) Previous example showed interaction term between two variables
You can include interaction terms between any pair of variables
But be careful not to have too many variables in the model
Number of observations should be at least 3-4 times number of variables in a regression equation
42. Which Interactions to Include? Do not go fishing for interaction terms in the data by including all interactions until something significant is found
Look at the underlying problem and think through if conceptually an interaction term makes sense
43. Take Home Lesson Multiple regression is similar to single variable regression in concept.
Similar F test for regression.
Similar t test for coefficients.
Similar concept of .
Test the assumptions that residuals have a Normal distribution, constant variance, and are independent.
Test for collinearity.
Test for interactions. Test if there are non-linear relationships