340 likes | 1.17k Views
Regression: (2) Multiple Linear Regression and Path Analysis. Hal Whitehead BIOL4062/5062. Multiple Linear Regression and Path Analysis. Multiple linear regression assumptions parameter estimation hypothesis tests selecting independent variables collinearity polynomial regression
E N D
Regression:(2) Multiple Linear Regression and Path Analysis Hal Whitehead BIOL4062/5062
Multiple Linear Regression and Path Analysis • Multiple linear regression • assumptions • parameter estimation • hypothesis tests • selecting independent variables • collinearity • polynomial regression • Path analysis
Regression One Dependent Variable Y Independent Variables X1,X2,X3,...
Purposes of Regression 1. Relationship between Y and X's 2. Quantitative prediction of Y 3. Relationship between Y and X controlling for C 4. Which of X's are most important? 5. Best mathematical model 6. Compare regression relationships: Y1 on X, Y2 on X 7. Assess interactive effects of X's
Simple regression: one X • Multiple regression: two or more X's Y = ß0 + ß1X(1) + ß2X(2) + ß3X(3) + ... + ßkX(k) + E
Multiple linear regression:assumptions (1) • For any specific combination of X's, Y is a (univariate) random variable with a certain probability distribution having finite mean and variance (Existence) • Y values are statistically independent of one another (Independence) • Mean value of Y given the X's is a straight linear function of the X's (Linearity)
Multiple linear regression:assumptions (2) • The variance of Y is the same for any fixed combinations of X's (Homoscedasticity) • For any fixed combination ofX's, Y has a normal distribution (Normality) • There are no measurement errors in the X's (Xs measured without error)
Multiple linear regression:parameter estimation Y = ß0 + ß1X(1) + ß2X(2) + ß3X(3) + ... + ßkX(k) + E • Estimate the ß's in multiple regression using least squares • Sizes of the coefficients not good indicators of importance of X variables • Number of data points in multiple regression • at least one more than number of X’s • preferably 5 times number of X’s
Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) Multiple regression of Y [Log (CNS)] on: X’ s ß SE(ß) Log(Mass) -0.49 (0.70) Log(Fat) -0.07 (0.10) Log(Muscle) 1.03 (0.54) Log(Heart) 0.42 (0.22) Log(Bone) -0.07 (0.30) N=39
Multiple linear regression:hypothesis tests Usually test: H0: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + E H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + ... + ßk⋅X(k) + E F-test with k-j, n-(k-j)-1 degrees of freedom (“partial F-test”) H0: variables X(j+1),…,X(k) do not help explain variability in Y
Multiple linear regression:hypothesis tests e.g. Test significance of overall multiple regression H0: Y = ß0 + E H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßk⋅X(k) + E • Test significance of • adding independent variable • deleting independent variable
Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) Multiple regression of Y [Log (CNS)] on: X’ s ß SE(ß) P Log(Mass) -0.49 (0.70) 0.49 Log(Fat) -0.07 (0.10) 0.52 Log(Muscle) 1.03 (0.54) 0.07 Log(Heart) 0.42 (0.22) 0.06 Log(Bone) -0.07 (0.30) 0.83 Tests whether removal of variable reduces fit
Multiple linear regression:selecting independent variables • Reasons for selecting a subset of independent variables (X’s): • cost (financial and other) • simplicity • improved prediction • improved explanation
Multiple linear regression:selecting independent variables • Partial F-test • predetermined forward selection • forward selection based upon improvement in fit • backward selection based upon improvement in fit • stepwise (backward/forward) • Mallow’s C(p) • AIC
Multiple linear regression:selecting independent variables • Partial F-test • predetermined forward selection • Mass, Bone, Heart, Muscle, Fat • forward selection based upon improvement in fit • backward selection based upon improvement in fit • Stepwise (backward/forward)
Multiple linear regression:selecting independent variables • Partial F-test • predetermined forward selection • forward selection based upon improvement in fit • backward selection based upon improvement in fit • stepwise (backward/forward)
Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) • Complete model (r2=0.97): • Forward stepwise (α-to-enter=0.15; α-to-remove=0.15): • 1. Constant (r2=0.00) • 2. Constant + Muscle (r2=0.97) • 3. Constant + Muscle + Heart (r2=0.97) • 4. Constant + Muscle + Heart + Mass (r2=0.97) -0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart
Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004) • Complete model (r2=0.97): • Backward stepwise (α-to-enter=0.15; α-to-remove=0.15): • 1. All (r2=0.97) • 2. Remove Bone (r2=0.97) • 3. Remove Fat (r2=0.97) -0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart
Comparing models • Mallow’s C(p) • C(p) = (k-p).F(p) + (2p-k+1) • k parameters in full model; p parameters in restricted model • F(p) is the F value comparing the fit of the restricted model with that of the full model • Lowest C(p) is best model • Akaike Information Criteria (AIC) • AIC=n.Log(σ2) +2p • Lowest AIC indicates best model • Can compare models not included in one another
Collinearity • If two (or more) X’s are linearly related: • they are collinear • the regression problem is indeterminate X(3)=5.X(2)+16, or X(2)=4.X(1)+ 16.X(4) • If they are nearly linearly related (near collinearity), coefficients and tests are very inaccurate
What to do about collinearity? • Centering (mean = 0) • Scaling (SD =1) • Regression on first few Principal Components • Ridge Regression
Curvilinear (Polynomial) Regression • Y = ß0 + ß1⋅X + ß2⋅X² + ß3⋅X3 + ... + ßk⋅Xk + E • Used to fit fairly complex curves to data • ß’s estimated using least squares • Use sequential partial F-tests, or AIC, to find how many terms to use • k>3 is rare in biology • Better to transform data and use simple linear regression, when possible
Curvilinear (Polynomial) Regression Y=0.066 + 0.00727.X Y=0.117 + 0.00085.X + 0.00009.X² Y=0.201 - 0.01371.X + 0.00061.X² - 0.000005.X3 From Sokal and Rohlf
A B C D E Path Analysis • Models with causal structure • Represented by path diagram • All variables quantitative • All path relationships assumed linear • (transformations may help)
A B C D E U Path Analysis • All paths one way • A => C • C => A • No loops • Some variables may not be directly observed: • residual variables (U) • Some variables not observed but known to exist • latent variables (D)
A B C D E U Path Analysis • Path coefficients and other statistics calculated using multiple regressions • Variables are: • centered (mean = 0) so no constants in regressions • often standardized (SD = 1) • So: path coefficients usually between -1 and +1 • Paths with coefficients not significantly different from zero may be eliminated
Path Analysis: an example • Isaak and Hubert. 2001. “Production of stream habitat gradients by montane watersheds: hypothesis tests based on spatially explicit path analyses” Can. J. Fish. Aquat. Sci.
- - - Predicted negative interaction ________ Predicted positive interaction