Stats 330: Lecture 30

Stats 330: Lecture 30 Association Reversal and Simpson's Paradox

Plan of the day In today’s lecture we continue our discussion of contingency tables. Topics • Simpson’s Paradox • Collapsing tables • Adequacy of the chi-square approximation See also Tutorial 10 (can download)

Example: Florida murders If we “collapse” the table over victim, we get the 2x 2 table of defendant by death penalty: Odds of being black for “no DP group” are 149/141 Odds of being black for “DP group” is 17/19 OR = (149/141)/(17/19) = (149*19)/(141*17) = 1.181 The odds of being black are 18% higher for the no DP group than for the DP group i.e. blacks favoured

Example: Florida murders The tables conditional on victim are Victim=black, OR =0.79 (add 0.5 to each cell) Victim=white, OR =0.67 Now odds of being black are greater in the DP group!

Association Reversal In our regression lectures, we saw that regression coefficients could change sign when new variables were added to the model. i.e. the coefficient of x1 in the model y ~ x1 can have the opposite sign to the coefficient of x1 in the model y ~ x1 + x2 • See e.g. Lecture 5, Slides 12 and 13.

The reason • The coefficient of x1 in the model y~x1 is the slope of the fitted line in the scatter plot, summarising the marginal relationship between y and x1 • The coefficient of x1 in the model y~x1+x2 is the slope of the line in the coplot, which summarises the relationship between y and x1,conditional on x2

Marginal and conditional association • In the first case, we are looking at association between y and x1 • In the second case, we are looking at the association between y and x1, with x2 held fixed (conditional on x2) • We can do the same sort of thing in contingency tables

Association in contingency tables • In 2 x 2 tables, we measure association using the odds ratio. • If we have 3 factors A, B and C, each at 2 levels, we can look at the marginal odds ratio of A and B, using the AB table, ignoring C. • Or, the 2 conditional odds ratios, one the AB table corresponding to C=1, and the other corresponding to C=2.

When will these be the same? • In ordinary regression, the coefficient of x1 in the model y~x1 will be the same as the coefficient of x1 in the model y~x1+x2 if and only if x1 and x2 are uncorrelated. • In contingency tables, the marginal and conditional population OR’s for the AB tables will be the same if • A and Care independent, given B, or, • B and Care independent, given A. • Thus, we can collapse the table over C if • The ABC interaction is zero, and either the AC interaction is zero, or the BC interaction is zero. • If this is not the case, association reversal ( Simpson’s paradox) may occur. The more associated A and/or B is with C, the more likely it is to happen.

Florida data: • Very strong association between DP and race of victim: if victim is black, small chance of DP • Very strong association between race of victim and race of defendant • Since blacks murder blacks, (and whites murder whites), not so many DPs for black defendants overall

When can we collapse? • Thus, we can collapse the table over C if • The ABC interaction is zero, and • Either the AC interaction is zero, or • The BC interaction is zero. If these conditions are not met, then the marginal and conditional tables may have different (even reversed) degrees of association

Florida data Recall that there are significant interactions between Race of victim and race of defendant Race of victim and death penalty Thus, can’t collapse over race of victim Call: glm(formula = counts ~ defendant * dp * victim, family = poisson, data = murder.df) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.27300 0.07161 73.633 < 2e-16 *** defendantw -2.32856 0.24033 -9.689 < 2e-16 *** dpy -2.70805 0.28645 -9.454 < 2e-16 *** victimw -0.61904 0.12105 -5.114 3.15e-07 *** defendantw:dpy -0.23639 1.06521 -0.222 0.82438 defendantw:victimw 3.25433 0.26657 12.208 < 2e-16 *** dpy:victimw 1.18958 0.36750 3.237 0.00121 ** defendantw:dpy:victimw -0.16131 1.10322 -0.146 0.88375

In graphical form B Can collapse over B or C but not A A C

Example: Berkeley admissions • Admission data from UC Berkeley grad school: • OR’s are (Odds admission male/odds admission female)

Collapse over departments • OR = 1.84 (odds admission male/ odds admission female) • Seems to be strong evidence of bias against women • Reason: females apply for programs that have low admission rates, males for programs that have high admission rates, so strong dependence between dept and the other 2 factors

Anova Table Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 23 2650.10 Dept 5 159.52 18 2490.57 1.251e-32 Gender 1 162.87 17 2327.70 2.665e-37 Admit 1 230.03 16 2097.67 5.879e-52 Dept:Gender 5 1220.61 11 877.06 1.006e-261 Dept:Admit 5 855.32 6 21.74 1.242e-182 Gender:Admit 1 1.53 5 20.20 0.22 Dept:Gender:Admit 5 20.20 0 7.239e-14 1.144e-03

Thus • 3 factor interaction is significant, so • Dept is not conditionally independent of admission, given gender • Dept is not conditionally independent of gender, given admission • Collapse over Dept at your peril!

Moral: Collapsing tables • Dangerous to collapse tables when factor being collapsed over (eg victim’s race, Berkeley department) is strongly associated with the other factors • Marginal (collapsed) tables may give a very misleading picture, need to look at conditional tables

How good is chi-square? • The bigger the cell counts, the better the chi-square approximation to the deviance • The smaller the number of cells, the better the approximation • But approximation is often OK even if some cells counts are small: our next example illustrates this.

The Dayton Study The data in this case study were gathered by researchers investigating teenage substance abuse in a community near Dayton, Ohio in 1992. • The researchers asked 2276 high school students if they had ever used alcohol, cigarettes or marijuana.

The data A 3-dimensional contingency table:

Making the data frame: > counts = c(911,3,44,2,538,43,456,279) > ACM.df = data.frame(counts, expand.grid(A=c("Yes","No"), C=c("Yes","No"), M=c("Yes","No"))) > ACM.df counts A C M 1 911 Yes Yes Yes 2 3 No Yes Yes 3 44 Yes No Yes 4 2 No No Yes 5 538 Yes Yes No 6 43 No Yes No 7 456 Yes No No 8 279 No No No

Fitting models ham.glm = glm(counts~A*C*M, family=poisson, data=ACM.df) > anova(ACM.glm, test="Chisq") Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 7 2851.46 A 1 1281.71 6 1569.75 1.064e-280 C 1 227.81 5 1341.93 1.787e-51 M 1 55.91 4 1286.02 7.575e-14 A:C 1 442.19 3 843.83 3.607e-98 A:M 1 346.46 2 497.37 2.504e-77 C:M 1 497.00 1 0.37 4.283e-110 A:C:M 1 0.37 0 2.509e-14 0.54

Try the indicated homogeneous association model > ham.glm = glm(counts~A*C+A*M+C*M, family=poisson, data=ACM.df) > summary(ham.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.81387 0.03313 205.699 < 2e-16 *** ANo -5.52827 0.45221 -12.225 < 2e-16 *** CNo -3.01575 0.15162 -19.891 < 2e-16 *** MNo -0.52486 0.05428 -9.669 < 2e-16 *** ANo:CNo 2.05453 0.17406 11.803 < 2e-16 *** ANo:MNo 2.98601 0.46468 6.426 1.31e-10 *** CNo:MNo 2.84789 0.16384 17.382 < 2e-16 *** Null deviance: 2851.46098 on 7 degrees of freedom Residual deviance: 0.37399 on 1 degrees of freedom

Summary • The model counts~ A*C + C*M + A*M seems to fit well • This is the homogeneous association model, can write counts~ (C+A+M)^2 Or counts~ A*C*M – A:C:M with all possible 2-way interactions but no 3-way interaction

Consider the 3-dimensional table: it had two small counts We fitted a model to these data having all 2 factor interactions but no 3-factor interactions. We can estimate the Poisson means using predict Is Chi-square adequate? > means = predict(ham.glm, type="response") > means 1 2 3 4 5 6 7 8 910.383170 3.616830 44.616830 1.383170 538.616830 42.383170 455.383170 279.616830

Is Chisquare adequate (2) • Assuming these estimated means are the true ones, we can simulate a table by generating 8 new table entries. • Thus, for example, if cell 1 has assumed mean 910.383, we generate an entry for cell 1 by selecting a value at random from a Poisson distribution with mean 910.38 > rpois(1, 910.38) [1] 969 • Repeat for every cell, get a simulated table • Calculate the deviance of the model for the simulated table • Repeat for 10000 tables, record deviance each time.

Results > 1-pchisq(0.37399,1) [1] 0.5408374 > mean(result > 0.37399) [1] 0.6006 Not too bad! Deviance = 0.37399

Code means = predict(ham.glm, type="response") Nsim=10000 # generate 10000 tables data<-matrix(rpois(Nsim*8, means), 8, Nsim) result<-numeric(Nsim) for(i in 1:Nsim)result[i]<-deviance(glm(data[,i]~ A*C + C*M + A*M , family=poisson, data=ACM.df)) # draw a picture of the results hist(result, nclass=30, freq=F, xlab="deviance",main="Histogam of 10000 simulated deviances") # add chisquare 1 density xx<-seq(0,14,length=100) lines(xx,dchisq(xx,1), lwd=2, col="red")

Stats 330: Lecture 30