Sociology 601 Class 21: November 10, 2009

Sociology 601 Class 21: November 10, 2009 • Review • formulas for b and se(b) • stata regression commands & output • Violations of Model Assumptions, and their effects (9.6) • Causality (10)

Formulas for b, a, r, and se(b)

Stata Example of Inference about a Slope • . summarize murder poverty • Variable | Obs Mean Std. Dev. Min Max • -------------+-------------------------------------------------------- • murder | 51 8.727451 10.71758 1.6 78.5 • poverty | 51 14.25882 4.584242 8 26.4 • . regress murder poverty • Source | SS df MS Number of obs = 51 • -------------+------------------------------ F( 1, 49) = 23.08 • Model | 1839.06931 1 1839.06931 Prob > F = 0.0000 • Residual | 3904.25223 49 79.6786169 R-squared = 0.3202 • -------------+------------------------------ Adj R-squared = 0.3063 • Total | 5743.32154 50 114.866431 Root MSE = 8.9263 • ------------------------------------------------------------------------------ • murder | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339 • _cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707 • -----------------------------------------------------------------------------

Stata Example of Inference about a Slope . correlate murder poverty (obs=51) | murder poverty -------------+------------------ murder | 1.0000 poverty | 0.5659 1.0000 . correlate murder poverty, covariance (obs=51) | murder poverty -------------+------------------ murder | 114.866 poverty | 27.8024 21.0153 sqrt(114.866) = 14.26 = sd(y); sqrt (21.0153) = 8.73 = sd(x)

Alternative Formula for b b = 27.8024 / 21.0153 = 1.323

Stata Example of Inference about a Slope scatter murder poverty || lfit murder poverty

Stata Example of Inference about a Slope . regress murder poverty if state!="DC" Source | SS df MS Number of obs = 50 -------------+------------------------------ F( 1, 48) = 31.36 Model | 307.342297 1 307.342297 Prob > F = 0.0000 Residual | 470.406476 48 9.80013492 R-squared = 0.3952 -------------+------------------------------ Adj R-squared = 0.3826 Total | 777.748773 49 15.8724239 Root MSE = 3.1305 ------------------------------------------------------------------------------ murder | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- poverty | .5842405 .104327 5.60 0.000 .3744771 .7940039 _cons | -.8567153 1.527798 -0.56 0.578 -3.92856 2.215129 ------------------------------------------------------------------------------

Assumptions Needed to make Population Inferences for slopes. • The sample is selected randomly. • X and Y are interval scale variables. • The mean of Y is related to X by the linear equation E{Y} =  + X. • The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) • The conditional distribution of Y at each value of X is normal. • There is no error in the measurement of X.

Common Ways to Violate These Assumptions • The sample is selected randomly. • Cluster sampling (e.g., census tracts / neighborhoods) causes observations in any cluster to be more similar than to observations outside the cluster. • Autocorrelation (spatial and temporal) • Two or more siblings in the same family. • Sample = populations (e.g., states in the U.S.) • X and Y are interval scale variables. • Ordinal scale attitude measures • Nominal scale categories (e.g., race/ethnicity, religion)

Common Ways to Violate These Assumptions (2) • The mean of Y is related to X by the linear equation E{Y} =  + X. • U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita) • Thresholds: • Logarithmic (e.g., earnings <- education) • The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) • earnings <- education • hours worked <- years • adult child occupational status <- parental occupational status

Common Ways to Violate These Assumptions (3) • The conditional distribution of Y at each value of X is normal. • earnings (skewed) <- education • Y is binary • Y is a % • There is no error in the measurement of X. • almost everything • what is the effect of measurement error in x on b?

Things to watch out for: extrapolation. • Extrapolation beyond observed values of X is dangerous. • The pattern may be nonlinear. • Even if the pattern is linear, the standard errors become increasingly wide. • Be especially careful interpreting the Y-intercept: it may lie outside the observed data. • e.g., year zero • e.g., zero education in the U.S. • e.g., zero parity

Things to watch out for: outliers • Influential observations and outliers may unduly influence the fit of the model. • The slope and standard error of the slope may be affected by influential observations. • This is an inherent weakness of least squares regression. • You may wish to evaluate two models; one with and one without the influential observations.

Things to watch out for: truncated samples • Truncated samples cause the opposite problems of influential observations and outliers. • Truncation on the X axis reduces the correlation coefficient for the remaining data. • Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors. • Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year.

Causality • We never prove that x causes y • Research and theory make it increasingly likely • Criteria: • association • time order • no alternative explanations • is the relationship spurious?

Alternative Explanations • Example: Neighborhood poverty -> Low Test Scores

Alternative Explanations • Example: Neighborhood poverty -> Low Test Scores • Possible solutions: • multivariate models • e.g., control for parents’ education, income • controls for other measureable differences • fixed effects models • e.g., changes in poverty -> changes in test scores • controls for constant, unmeasured differences • instrumental variables • find an instrument that affects x1 but not y • experiments • e.g., Moving to Opportunity • randomize increases in $

Alternative Explanations • Example: Fertility -> Lower Mothers’ LFP • Possible solutions:

Alternative Explanations • Example: Fertility -> Lower Mothers’ LFP • Possible solutions: • multivariate models • e.g., control for gender attitudes • controls for other measureable differences • fixed effects models • e.g., changes in # children -> dropping out • controls for constant, unmeasured differences • instrumental variables • find an instrument that affects x1 but not y • e.g., mothers of two same sex children • experiments • not feasible (or ethical)

Types of 3-variable Causal Models • Spurious • x2 causes both x1 and y • e.g., religion causes fertility and women’s lfp • Intervening • x1 causes x2 which causes y • e.g., fertility raises time spent on children which lowers time in the labor force • What is the statistical difference between these?

Another type of 3-varaible relationship: Statistical Interaction Effects • Example: Fertility -> Lower Mothers’ LFP • The relationship between x1 and y depends on the value of another variable, x2 • e.g., marital status -> earnings depends on gender

Sociology 601 Class 21: November 10, 2009