1 / 95

Statistical Inference and Regression Analysis: GB.3302.30

Statistical Inference and Regression Analysis: GB.3302.30. Professor William Greene Stern School of Business IOMS Department Department of Economics. Inference and Regression. Part 9 – Linear Model Topics. Agenda. Variable Selection – Stepwise Regression

kavindra
Download Presentation

Statistical Inference and Regression Analysis: GB.3302.30

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Inference and Regression Analysis: GB.3302.30 Professor William Greene Stern School of Business IOMS Department Department of Economics

  2. Inference and Regression Part 9 – Linear Model Topics

  3. Agenda • Variable Selection – Stepwise Regression • Partial Regression – The Meaning of Multiple Regression • Panel Data • Test of Regression Stability • Generalized Regression • Robust inference for OLS regression • Heteroscedasticity and weighted least squares • Autocorrelation and generalized least squares

  4. Inference and Regression Stepwise Regression

  5. Stepwise Regression • Start with (a) no model, or (b) the specific variables that are designated to be forced to into whatever model ultimately chosen • (A: Forward step) Add a variable: “Significant?” Include the most “significant variable” not already included. • (B: Backward step) Are variables already included in the equation now adversely affected by collinearity? If any variables become “insignificant,” now remove the least significant variable. • Return to (A) • This can cycle back and forth for a while. Usually not. • Ultimately selects only variables that appear to be “significant”

  6. Stepwise Regression Feature

  7. Specify Procedure All 10 predictors Subset of predictors that must appear in the final model chosen (optional) No need to change Methods or Options I changed P value for inclusion to .10.

  8. Used 0.10 as the cutoff “p-value” for inclusion or removal. All P values will be less than or equal to .10.

  9. Stepwise Regression • What’s Right with It? • Automatic – push button • Simple to use. Not much thinking involved. • Relates in some way to connection of the variables to each other – significance – not just R2 • What’s Wrong with It? • No reason to assume that the resulting model will make any sense • Test statistics are completely invalid and cannot be used for statistical inference. (Can’t be t ratios if you know in advance they will be larger than 2.)

  10. Inference and Regression Multiple Regression

  11. The Frisch-Waugh Theorem • Multiple Regression • Partialing out the effect of a variable

  12. U.S. Gasoline Market, 1953-2004

  13. Multiple Regression of logG on logPG and logY

  14. Two Side Regressions Regress logG on a constant and logY and compute residuals RESLOGG Regress logPg on a constant and logY and compute residuals RESLOGPG

  15. Interesting Plots Original regression of logG on a constant and logPg. The line slopes the wrong way. New regression of ReslogG on a constant and ReslogPg. The line slopes the right way.

  16. Regression of Residuals on Residuals

  17. NLOGIT

  18. Minitab Use CALC to compute logg=loge(g), logpg=loge(pg), logy=loge(pcincome). Regression of logg on logpg and logy. To save residuals, use Storage as above. Residuals are saved as RESI1 and RESI2 in data area.

  19. Frisch-Waugh (1933) Theorem Context: Model contains two sets of variables: X = [ [1,time] | [ other variables]] = [X1X2] Regression model: y = X11 + X22 + (population) = X1b1 + X2b2 + e (sample) Problem: Algebraic expression for the second set of least squares coefficients, b2

  20. Frisch-Waugh Result “We get the same result whether we (1) detrend the other variables by using the residuals from a regression of them on a constant and a time trend and use the detrended data in the regression or (2) just include a constant and a time trend in the regression and not detrend the data” “Detrend the data” means compute the residuals from the regressions of the variables on a constant and a time trend.

  21. Partitioned Solution Method of solution (Why did F&W care? In 1933, matrix computation was not trivial!) Direct manipulation of normal equations produces

  22. Partitioned Solution Direct manipulation of normal equations produces b2 = (X2X2)-1X2(y - X1b1) What is this? Regression of (y - X1b1) on X2

  23. The Frisch-Waugh Result Continuing the algebraic manipulation, the solution for b2 is: b2 = [(X2’M1)(M1X2)]-1[(X2’M1)(M1y)] where M1 = I-X1(X1’X1)-1X1’ and M1X2and M1y are residuals in regressions on X1. This is Frisch and Waugh’s famous result - the “double residual regression.” How do we interpret this? A regression of residuals on residuals. “We get the same result whether we (1) detrend the other variables by using the residuals from a regression of them on a constant and a time trend and use the detrended data in the regression or (2) just include a constant and a time trend in the regression and not detrend the data” “Detrend the data” means compute the residuals from the regressions of the variables on a constant and a time trend.

  24. Partial Regression Important terms in this context: Partialing out the effect of X1. Netting out the effect … “Partial regression coefficients.” To continue belaboring the point: Note the interpretation of partial regression as “net of the effect of …” Now, follow this through for the case in which X1 is just a constant term, column of ones. What are the residuals in a regression on a constant. What is M1? Note that this produces the result that we can do linear regression on data in mean deviation form. 'Partial regression coefficients' are the same as 'multiple regression coefficients.' It follows from the Frisch-Waugh theorem.

  25. Understanding Multiple Regresion • In a multiple regression, the coefficient on an x is interpreted to give the effect of change in x on change in y holding everything else constant. • That is, “net of the effect of everything else.” • How can y=a+b1Educ+b2Age+e? Each year of education means aging by 1 year. How is it possible to hold age constant and increase education by 1 year?

  26. Application – Health and Income German Health Care Usage Data, 7,293 Individuals, Varying Numbers of PeriodsVariables in the file areData downloaded from Journal of Applied Econometrics Archive. This is an unbalanced panel with 7,293 individuals. There are altogether 27,326 observations.  The number of observations ranges from 1 to 7 per family.  (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000, 7=987).  The dependent variable of interest is DOCVIS = number of visits to the doctor in the observation period HHNINC =  household nominal monthly net income in German marks / 10000. (4 observations with income=0 were dropped)HHKIDS = children under age 16 in the household = 1; otherwise = 0EDUC =  years of schooling AGE = age in years

  27. Multiple Regression

  28. Inference and Regression Panel Data

  29. A regression model with a dummy variable for each individual in the sample, each observed Ti times. yi = Xi + diαi + εi, for each individual THE Application of Frisch-Waugh The Fixed Effects Model N columns N may be thousands. I.e., the regression has thousands of variables (coefficients).

  30. Application – Health and Income German Health Care Usage Data, 7,293 Individuals, Varying Numbers of PeriodsVariables in the file areData downloaded from Journal of Applied Econometrics Archive. This is an unbalanced panel with 7,293 individuals. There are altogether 27,326 observations.  The number of observations ranges from 1 to 7 per family.  (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000, 7=987).  The dependent variable of interest is DOCVIS = number of visits to the doctor in the observation period HHNINC =  household nominal monthly net income in German marks / 10000. (4 observations with income=0 were dropped)HHKIDS = children under age 16 in the household = 1; otherwise = 0EDUC =  years of schooling AGE = age in years We desire also to include a separate family effect (7293 of them) for each family. This requires 7293 dummy variables in addition to the four regressors.

  31. Estimating the Fixed Effects Model The FEM is a linear regression model but with many independent variables

  32. Fixed Effects Estimator (cont.)

  33. ‘Within’ Transformations

  34. Least Squares Dummy Variable Estimator • b is obtained by ‘within’ groups least squares (group mean deviations) • Normal equations for a are D’Xb+D’Da=D’y a = (D’D)-1D’(y – Xb)

  35. Inference and Regression Chow Test

  36. Test of Structural Stability Two groups, cleverly labeled Group 1 and Group 2. Regression model applies to the two groups: yj = Xjj + j Null hypothesis: 1 = 2 Test using an F statistic.

  37. Testing Strategy: Setup • Fit separate regressions for the two groups. • Separate coefficient vectors b1 and b2 • Each coefficient vector is bj = (Xj’Xj)-1Xj’yj • Sums of squares e1’e1 = (y1 - X1b1)’(y1 - X1b1)and e2’e2 = (y2 – X2b2)’(y2 – X2b2) • Total separate sum of squares = SS12 = e1’e1 + e2’e2 • Pooled regression b = (X1’X1 + X2’X2)-1 (X1’y1 + X2’y2) • Pooled sum of squares • SSpooled = e1’e1 = (y1 - X1b)’(y1 - X1b) + (y2 – X2b)’(y2 – X2b) • SSpooled must be > SS12

  38. Testing Strategy • Rejection Regions • (1) b1 is very different from b2 • (2) SSpooled is much larger than SS12 • These are the same.

  39. Application • Health satisfaction depends on many factors: • Age, Income, Children, Education, Marital Status • Do these factors figure differently in a model for women compared to men? • Investigation: Multiple regression • Hypothesis: The regressions are the same. • Rejection Region: Estimated regressions that are very different.

  40. Equal Regressions • Setting: Two groups of observations (men/women, countries, two different periods, firms, etc.) • Regression Model: y = α+β1x1+β2x2 + … + ε • Hypothesis: The same model applies to both groups • Rejection region: Large values of F

  41. Procedure: Equal Regressions • There are N1 observations in Group 1 and N2 in Group 2. • There are K variables and the constant term in the model. • This test requires you to compute three regressions and retain the sum of squared residuals from each: • SS1 = sum of squares from N1 observations in group 1 • SS2 = sum of squares from N2 observations in group 2 • SSALL = sum of squares from NALL=N1+N2 observations when the two groups are pooled. • The hypothesis of equal regressions is rejected if F is larger than the critical value from the F table (K numerator and NALL-2K-2 denominator degrees of freedom)

  42. Health Satisfaction Models: Men vs. Women +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error | T |P value]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ Women===|=[NW = 13083]================================================ Constant| 7.05393353 .16608124 42.473 .0000 1.0000000 AGE | -.03902304 .00205786 -18.963 .0000 44.4759612 EDUC | .09171404 .01004869 9.127 .0000 10.8763811 HHNINC | .57391631 .11685639 4.911 .0000 .34449514 HHKIDS | .12048802 .04732176 2.546 .0109 .39157686 MARRIED | .09769266 .04961634 1.969 .0490 .75150959 Men=====|=[NM = 14243]================================================ Constant| 7.75524549 .12282189 63.142 .0000 1.0000000 AGE | -.04825978 .00186912 -25.820 .0000 42.6528119 EDUC | .07298478 .00785826 9.288 .0000 11.7286996 HHNINC | .73218094 .11046623 6.628 .0000 .35905406 HHKIDS | .14868970 .04313251 3.447 .0006 .41297479 MARRIED | .06171039 .05134870 1.202 .2294 .76514779 Both====|=[NALL = 27326]============================================== Constant| 7.43623310 .09821909 75.711 .0000 1.0000000 AGE | -.04440130 .00134963 -32.899 .0000 43.5256898 EDUC | .08405505 .00609020 13.802 .0000 11.3206310 HHNINC | .64217661 .08004124 8.023 .0000 .35208362 HHKIDS | .12315329 .03153428 3.905 .0001 .40273000 MARRIED | .07220008 .03511670 2.056 .0398 .75861817 German survey data over 7 years, 1984 to 1991 (with a gap). 27,326 observations on Health Satisfaction and several covariates.

  43. Computing the F Statistic +--------------------------------------------------------------------------------+ | Women Men All | | LHS=HEALTH Mean = 6.634172 6.924362 6.785662 | | Standard deviation = 2.329513 2.251479 2.293725 | | Number of observs. = 13083 14243 27326 | | Model size Parameters = 6 6 6 | | Degrees of freedom = 13077 14237 27320 | | Residuals Sum of squares = 66677.66 66705.75 133585.3 | | Standard error of e = 2.258063 2.164574 2.211256 | | Fit R-squared = 0.060762 0.076033 .070786 | | Model test F (P value) = 169.20(.000) 234.31(.000) 416.24 (.0000) | +--------------------------------------------------------------------------------+

  44. Inference and Regression Generalized Regression

  45. Generalized Regression • “General” in that the main assumptions of the regression model are not met. • Heteroscedasticity • Serial correlation (autocorrelation)

  46. Conditional Homoscedasticity and Nonautocorrelation Disturbances provide no information about each other. • Var[i | X ] = 2 • Cov[i, j|X] = 0

  47. Data on 18 OECD countries, 19 years, Gasoline consumptionlogGas/Car=a+b1logIncome+b2logP+b3logcars/Capita+e

  48. Heteroscedasticity Countries are ordered by the standard deviation of their 19 residuals. Regression of log of per capita gasoline use on log of per capita income, gasoline price and number of cars per capita for 18 OECD countries for 19 years. The standard deviation varies by country.

  49. Heteroscedasticity: Regression results for 6 (of 18) countries

More Related