1 / 85

Chapter 13

Chapter 13. Simple Linear Regression & Correlation Inferential Methods. Deterministic Models.

paul
Download Presentation

Chapter 13

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 13 Simple Linear Regression & Correlation Inferential Methods © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  2. Deterministic Models • Consider the two variables x and y. A deterministic relationship is one in which the value of y (the dependent variable) is described by some formula or mathematical notation such as y = f(x), y = 3 + 2 x or • y = 5e-2x where x is the dependent variable. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  3. Probabilistic Models • A description of the relation between two variables x and y that are not deterministically related can be given by specifying a probabilistic model. • The general form of an additive probabilistic model allows y to be larger or smaller than f(x) by a random amount, e. • The model equation is of the form • Y = deterministic function of x + random deviation • = f(x) + e © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  4. e=-1.5 Probabilistic Models Deviations from the deterministic part of a probabilistic model © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  5. Simple Linear Regression Model The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the true or population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y =  + x + e Without the random deviation e, all observed points (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  6. Population regression line (Slope ) Observation when x = x1 (positive deviation) e2 e2 Observation when x = x2 (positive deviation) a = vertical intercept 0 x = x1 x = x2 0 Simple Linear Regression Model © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  7. Basic Assumptions of the Simple Linear Regression Model • The distribution of e at any particular x value has mean value 0 (µe = 0). • The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by . • The distribution of e at any particular x value is normal. • The random deviations e1, e2, …, en associated with different observations are independent of one another. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  8. For any fixed x value, y itself has a normal distribution. More About the Simple Linear Regression Model and (standard deviation of y for fixed x) = . © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  9. Small s Large s Interpretation of Terms The slope  of the population regression line is the mean (average) change in y associated with a 1-unit increase in x. The vertical intercept a is the height of the population line when x = 0. The size of  determines the extent to which the (x, y) observations deviate from the population line. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  10. Illustration of Assumptions © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  11. Estimates for the Regression Line The point estimates of b, the slope, and a, the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line. That is, © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  12. Interpretation of y = a + bx • Let x* denote a specific value of the predictor variable x. The a + bx* has two interpetations: • a + bx* is a point estimate of the mean y value when x = x*. • a + bx* is a point prediction of an individual y value to be observed when x = x*. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  13. Example The following data was collected in a study of age and fatness in humans. One of the questions was, “What is the relationship between age and fatness?” * Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon (153Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839 © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  14. Example © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  15. Example © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  16. Example © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  17. Example A point estimate for the %Fat for a human who is 45 years old is If 45 is put into the equation for x, we have both an estimated %Fat for a 45 year old human or an estimated average %Fat for 45 year old humans The two interpretations are quite different. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  18. Example A plot of the data points along with the least squares regression line created with Minitab is given to the right. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  19. Terminology © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  20. Definition formulae The total sum of squares, denoted by SSTo, is defined as The residual sum of squares, denoted by SSResid, is defined as © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  21. Calculation Formulae Recalled SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas: © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  22. Coefficient of Determination • The coefficient of determination, denoted by r2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  23. The statistic for estimating the variance s2 is where Estimated Standard Deviation, se © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  24. Estimated Standard Deviation, se The estimate of s is the estimated standard deviation The number of degrees of freedom associated with estimating 2 or  in simple linear regression is n - 2. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  25. Example continued © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  26. Example continued © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  27. Example continued With r2 = 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age. The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves. This would suggest that the model is only useful in the sense of provide gross “ballpark” estimates for %Fat for humans based on age. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  28. The standard deviation of the statistic b is Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the following conditions are met: • The mean value of b is . Specifically, • mb=b and hence b is an unbiased statistic for estimating  • The statistic b has a normal distribution (a consequence of the error e being normally distributed) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  29. The estimated standard deviation of the statistic b is When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2 Estimated Standard Deviation of b © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  30. Confidence interval for  When then four basic assumptions of the simple linear regression model are satisfied, a confidence interval for , the slope of the population regression line, has the form b  (t critical value)sb where the t critical value is based on df = n - 2. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  31. A 95% confidence interval estimate for b is Example continued Recall © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  32. Example continued A 95% confidence interval estimate for b is Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  33. Regression line Estimated slope b Estimated y intercept a residual df = n -2 SSResid SSTo Example continued Minitab output looks like Regression Analysis: % Fat y versus Age (x) The regression equation is % Fat y = 3.22 + 0.548 Age (x) Predictor Coef SE Coef T P Constant 3.221 5.076 0.63 0.535 Age (x) 0.5480 0.1056 5.19 0.000 S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4% Analysis of Variance Source DF SS MS F P Regression 1 891.87 891.87 26.94 0.000 Residual Error 16 529.66 33.10 Total 17 1421.54 © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  34. Hypothesis Tests Concerning  Null hypothesis: H0:  = hypothesized value © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  35. Hypothesis Tests Concerning  Alternate hypothesis and finding the P-value: • Ha:  > hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the right of the calculated t • Ha:  < hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the left of the calculated t © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  36. Hypothesis Tests Concerning  • Ha:  hypothesized value • If t is positive, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the right of the calculated t) • If t is negative, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the left of the calculated t) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  37. Hypothesis Tests Concerning  Assumptions: The distribution of e at any particular x value has mean value 0 (me = 0) The standard deviation of e is , which does not depend on x The distribution of e at any particular x value is normal The random deviations e1, e2, … , en associated with different observations are independent of one another © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  38. The test statistic simplifies to and is called the t ratio. Hypothesis Tests Concerning  Quite often the test is performed with the hypotheses H0:  = 0 vs. Ha:  0 This particular form of the test is called the model utility test for simple linear regression. The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  39. Example Consider the following data on percentage unemployment and suicide rates. * Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  40. Example The plot of the data points produced by Minitab follows © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  41. Example © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  42. Example Some basic summary statistics © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  43. Example Continuing with the calculations © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  44. Example Continuing with the calculations © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  45. Example © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  46. Test statistic: Example - Model Utility Test •  = the true average change in suicide rate associated with an increase in the unemployment rate of 1 percentage point • H0:= 0 • Ha: 0 • has not been preselected. We shall interpret the observed level of significance (P-value) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  47. Example - Model Utility Test • Assumptions: The following plot (Minitab) of the data shows a linear pattern and the variability of points does not appear to be changing with x. Assuming that the distribution of errors (residuals) at any given x value is approximately normal, the assumptions of the simple linear regression model are appropriate. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  48. Calculation: Example - Model Utility Test • P-value: The table of tail areas for t-distributions only has t values  4, so we can see that the corresponding tail area is < 0.002. Since this is a two-tail test the P-value < 0.004. (Actual calculation gives a P-value = 0.002) © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  49. Example - Model Utility Test • Conclusion: • Even though no specific significance level was chosen for the test, with the P-value being so small (< 0.004) one would generally reject the null hypothesis that  = 0 and conclude that there is a useful linear relationship between the % unemployed and the suicide rate. © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

  50. P-value T value for Model Utility Test H0: b = 0 Ha: b 0 Example - Minitab Output Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x) The regression equation is Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x) Predictor Coef SE Coef T P Constant -93.86 51.25 -1.83 0.100 Percenta 59.05 14.24 4.15 0.002 S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8% © 2008 Brooks/Cole, a division of Thomson Learning, Inc.

More Related