210 likes | 432 Views
Problems with infinite solutions in logistic regression. Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006. Introduction. I teach logistic regression for the analysis of case-control studies to Epidemiology Master’s students, using Stata
E N D
Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge UK Stata Users’ Group London, 12th September 2006
Introduction • I teach logistic regression for the analysis of case-control studies to Epidemiology Master’s students, using Stata • I stress how to work out degrees of freedom • e.g. if E has 2 levels and M has 4 levels then you get 3 d.f. for testing the E*M interaction • Our practical uses data on 244 cases of leprosy and 1027 controls • previous BCG vaccination is the exposure of interest • level of schooling is a possible effect modifier • in what follows I’m ignoring other confounders
Leprosy data -> tabulation of d outcome 0=control, | 1=case | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,027 80.80 80.80 1 | 244 19.20 100.00 ------------+----------------------------------- Total | 1,271 100.00 -> tabulation of bcg exposure BCG scar | Freq. Percent Cum. ------------+----------------------------------- Absent | 743 58.46 58.46 Present | 528 41.54 100.00 ------------+----------------------------------- Total | 1,271 100.00 -> tabulation of school possible effect modifier Schooling | Freq. Percent Cum. ------------+----------------------------------- 0 | 282 22.19 22.19 1 | 606 47.68 69.87 2 | 350 27.54 97.40 3 | 33 2.60 100.00 ------------+----------------------------------- Total | 1,271 100.00
Main effects model . xi: logistic d i.bcg i.school i.bcg _Ibcg_0-1 (naturally coded; _Ibcg_0 omitted) i.school _Ischool_0-3 (naturally coded; _Ischool_0 omitted) Logistic regression Number of obs = 1271 LR chi2(4) = 97.50 Prob > chi2 = 0.0000 Log likelihood = -572.86093 Pseudo R2 = 0.0784 ------------------------------------------------------------------------------ d | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ibcg_1 | .2908624 .0523636 -6.86 0.000 .204384 .4139314 _Ischool_1 | .7035071 .1197049 -2.07 0.039 .5040026 .9819836 _Ischool_2 | .4029998 .0888644 -4.12 0.000 .2615825 .6208704 _Ischool_3 | .09077 .0933769 -2.33 0.020 .0120863 .6816944 ------------------------------------------------------------------------------ . estimates store main
Interaction model . xi: logistic d i.bcg*i.school i.bcg _Ibcg_0-1 (naturally coded; _Ibcg_0 omitted) i.school _Ischool_0-3 (naturally coded; _Ischool_0 omitted) i.bcg*i.school _IbcgXsch_#_# (coded as above) Logistic regression Number of obs = 1271 LR chi2(7) = 101.43 Prob > chi2 = 0.0000 Log likelihood = -570.90012 Pseudo R2 = 0.0816 ------------------------------------------------------------------------------ d | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ibcg_1 | .2248804 .0955358 -3.51 0.000 .0977993 .5170913 _Ischool_1 | .6626409 .1234771 -2.21 0.027 .4599012 .9547549 _Ischool_2 | .4116581 .1027612 -3.56 0.000 .2523791 .6714598 _Ischool_3 | 1.28e-08 1.42e-08 -16.41 0.000 1.46e-09 1.12e-07 _IbcgXsch_~1 | 1.448862 .7046411 0.76 0.446 .5585377 3.758385 _IbcgXsch_~2 | 1.086848 .6226504 0.15 0.884 .3536056 3.340553 _IbcgXsch_~3 | 4.25e+07 . . . . . ------------------------------------------------------------------------------ Note: 17 failures and 0 successes completely determined. . estimates store inter
The problem . table bcg school, by(d) ---------------------------------- 0=control | , 1=case | and BCG | Schooling scar | 0 1 2 3 ----------+----------------------- 0 | Absent | 141 257 129 17 Present | 57 229 182 15 ----------+----------------------- 1 | Absent | 77 93 29 Present | 7 27 10 1 ----------------------------------
LR test . xi: logistic d i.bcg i.school LR chi2(4) = 97.50 Log likelihood = -572.86093 . estimates store main . xi: logistic d i.bcg*i.school LR chi2(7) = 101.43 Log likelihood = -570.90012 . estimates store inter . lrtest main inter Likelihood-ratio test LR chi2(2) = 3.92 (Assumption: main nested in inter) Prob > chi2 = 0.1407
What is Stata doing? (guess) • Recognises the information matrix is singular • Hence reduces model df by 1 • In other situations Stata drops observations • if a single variable perfectly predicts success/failure • this happens if the problematic cell doesn’t occur in a reference category • then Stata refuses to perform lrtest, but we can force it to do so • Stata still gets df=2; can use df(3) option
. gen bcgrev=1-bcg . xi: logistic d i.bcgrev*i.school i.bcgrev _Ibcgrev_0-1 (naturally coded; _Ibcgrev_0 omitted) i.school _Ischool_0-3 (naturally coded; _Ischool_0 omitted) i.bcg~v*i.sch~l _IbcgXsch_#_# (coded as above) note: _IbcgXsch_1_3 != 0 predicts failure perfectly _IbcgXsch_1_3 dropped and 17 obs not used Logistic regression Number of obs = 1254 LR chi2(6) = 94.12 Prob > chi2 = 0.0000 Log likelihood = -570.90012 Pseudo R2 = 0.0762 ------------------------------------------------------------------------------ d | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ibcgrev_1 | 4.446809 1.889136 3.51 0.000 1.933895 10.22502 _Ischool_1 | .9600749 .4312915 -0.09 0.928 .3980361 2.315729 _Ischool_2 | .4474097 .2307071 -1.56 0.119 .1628482 1.229215 _Ischool_3 | .5428571 .6013396 -0.55 0.581 .0619132 4.75979 _IbcgXsch_~1 | .6901971 .3356713 -0.76 0.446 .2660717 1.79039 _IbcgXsch_~2 | .920092 .5271167 -0.15 0.884 .2993516 2.82801 ------------------------------------------------------------------------------ . est store interrev . lrtest interrev main observations differ: 1254 vs. 1271 r(498); . lrtest interrev main, force Likelihood-ratio test LR chi2(2) = 3.92 (Assumption: main nested in interrev) Prob > chi2 = 0.1407
What’s right? • Zero cell suggests small sample so asymptotic c2 distribution may be inappropriate for LRT • true in this case: have a bcg*school category with only 1 observation • but I’m going to demonstrate the same problem in hypothetical example with expected cell counts > 3 but a zero observed cell count • Could combine or drop cells to get rid of zeroes • but the cell with zeroes may carry information • Problems with testing boundary values are well known • e.g. LRT for testing zero variance component isn’t c21 • but here the point estimate, not the null value, is on the boundary
Example to explain why LRT makes some sense . tab x y, chi2 exact | y x | 0 1 | Total -----------+----------------------+---------- 0 | 10 20 | 30 1 | 0 10 | 10 -----------+----------------------+---------- Total | 10 30 | 40 Pearson chi2(1) = 4.4444 Pr = 0.035 Fisher's exact = 0.043 1-sided Fisher's exact = 0.035
Model: logit P(y=1|x) = a + bx Difference in log lik = 3.4 LRT = 6.8 on 0 df?
Example to explore correct df using Pearson / Fisher as gold standard . tab x y, chi2 exact | y x | 0 1 | Total -----------+----------------------+---------- 1 | 6 0 | 6 2 | 3 6 | 9 3 | 3 6 | 9 -----------+----------------------+---------- Total | 12 12 | 24 Pearson chi2(2) = 8.0000 Pr = 0.018 Fisher's exact = 0.029 • All expected counts ≥3 • Don’t want to drop or merge category 1 - contains the evidence for association!
. xi: logistic y i.x i.x _Ix_1-3 (naturally coded; _Ix_1 omitted) Logistic regression Number of obs = 24 LR chi2(2) = 10.36 Prob > chi2 = 0.0056 Log likelihood = -11.457255 Pseudo R2 = 0.3113 ------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ix_2 | 1.61e+08 1.61e+08 18.90 0.000 2.27e+07 1.14e+09 _Ix_3 | 1.61e+08 . . . . . ------------------------------------------------------------------------------ Note: 6 failures and 0 successes completely determined. . est store x . xi: logistic y Logistic regression Number of obs = 24 LR chi2(0) = 0.00 Prob > chi2 = . Log likelihood = -16.635532 Pseudo R2 = 0.0000 ------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ------------------------------------------------------------------------------ . est store null
LRT . xi: logistic y i.x Log likelihood = -11.457255 . est store x . xi: logistic y Log likelihood = -16.635532 . est store null . lrtest x null Likelihood-ratio test LR chi2(1) = 10.36 (Assumption: null nested in x) Prob > chi2 = 0.0013
Comparison of tests | y x | 0 1 | Total -----------+----------------------+---------- 1 | 6 0 | 6 2 | 3 6 | 9 3 | 3 6 | 9 -----------+----------------------+---------- Total | 12 12 | 24 Pearson chi2(2) = 8.0000 P = 0.018 Fisher's exact = P = 0.029 LR chi2(1) = 10.36 P = 0.0013 (using 2df: P = 0.0056) Clearly LRT isn’t great. But 1df is even worse than 2df
Note • In this example, we could use Pearson / Fisher as gold standard. • Can’t do this in more complex examples (e.g. adjust for several covariates).
My proposal for Stata • lrtest appears to adjust df for infinite parameter estimates: it should not • Model df should be incremented to allow for any variables dropped because they perfectly predict success/failure • Don’t need to increment log lik as it is 0 for the cases dropped • Can the ad hoc handling of zeroes by xi:logistic be improved?
Conclusions for statisticians • Must remember the c2 approximation is still poor for these LRTs • typically anti-conservative? (Kuss, 2002) • Performance of LRT can be improved by using penalised likelihood (Firth, 1993; Bull, 2006) - like a mildly informative prior • worth using routinely? • Gold standard: Bayes or exact logistic regression (logXact)?
Output for example with 2-level x . logit y x Log likelihood = -19.095425 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | .6931472 .3872983 1.79 0.074 -.0659436 1.452238 ------------------------------------------------------------------------------ . estimates store x . logit y Log likelihood = -22.493406 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | 1.098612 .3651484 3.01 0.003 .3829346 1.81429 ------------------------------------------------------------------------------ . estimates store null . lrtest x null df(unrestricted) = df(restricted) = 1 r(498); . lrtest x null, force df(1) Likelihood-ratio test LR chi2(1) = 6.80 (Assumption: null nested in x) Prob > chi2 = 0.0091