The Cox model in R

The Cox model in R Gardar Sveinbjörnsson, Jongkil Kim, Yongsheng Wang

OUTLINE • Recidivism data • Cox PH Model for Time-Independent Variables in R • Model Selection • Model Diagnostics • Cox PH Model for Time-Dependent Variables in R • Summary 18,April 2011 Department of Mathematics, ETHZ

Recidivism data Recidivism data The data is from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison. Half of the prisoners were randomly given financial aid when they were released. The data is from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison. Half of the prisoners were randomly given financial aid when they were released. 18,April 2011 Department of Mathematics, ETHZ

Variables inRecidivism Data • week: week of first arrest after release, or censoring time. • arrest: the event indicator, 1 = arrested , 0 = not • fin: 1=received financial aid, 0= not • age: in years at the time of release • race: 1= black, 0= others • wexp: 1= had full-time work experience, 0= not • mar: 1= married, 0= not • paro: 1= released on parole, 0= not • prio: number of prior convictions • educ: codes 2 (grade 6 or less), 3 (grades 6 through 9), 4 (grades 10 and 11), 5 (grade 12), or 6 (some post-secondary). • emp1— emp52: 1= employed in the corresponding week, 0 = not 18,April 2011 Department of Mathematics, ETHZ

Recidivism Data > Rossi <- read.table(’Rossi.txt’, header=T) > Rossi[1:5, 1:10] ＃＃omitting the variables emp1 — emp52 week arrest fin age race wexp mar paro prio educ 1 20 1 0 27 1 0 0 1 3 3 2 17 1 0 18 1 0 0 1 8 4 3 25 1 0 19 0 1 0 1 13 3 4 52 0 1 23 1 1 1 1 1 5 5 52 0 0 19 0 1 0 1 3 3 18,April 2011 Department of Mathematics, ETHZ

Cox PH Model for Time-Independent Variables in R

Cox PH Model for Time-Independent Variables in R • Surv and coxph function in R • Cox Regression • Adjusted survival curve 18,April 2011 Department of Mathematics, ETHZ

Surv function in R > Surv(time, time2, event, type=c('right', 'left', 'interval', 'counting'), origin=0) Left-truncated and right-censored data Right-censored data • Surv(time, event) • time: survival or censoring time • event: the status indicator • 0=censored • 1=observed • Surv(time, time2, event) • time: left-truncation time • time2: survival or censoring time • event: the status indicator • 0= censored • 1= observed 18,April 2011 Department of Mathematics, ETHZ

Coxph function in R > coxph(formula, data=, weights, subset, na.action, init, control, method=c("efron","breslow","exact"), singular.ok=TRUE, robust=FALSE, model=FALSE, x=FALSE, y=TRUE, ...) • Most of the arguments are similar to lm 18,April 2011 Department of Mathematics, ETHZ

Coxph function in R • Formula • The right-hand side: the same as a linear model • The left-hand side: a survival object • Method : The method for tie handling. If there are no tied survival times all the methods are equivalent. • Breslow: the default for most Cox PH models • Efron: used as the default and much more accurate than Breslow when dealing with tied survival times • Exact: computes the exact partial likelihood 18,April 2011 Department of Mathematics, ETHZ

Cox regression > mod.allison <- coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio + as.factor(educ), data=Rossi) > mod.allison 18,April 2011 Department of Mathematics, ETHZ

Cox regression Call:coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio + as.factor(educ), data = Rossi) coef exp(coef) se(coef) z p fin -0.4027 0.669 0.1930 -2.087 0.0370 age -0.0514 0.950 0.0222 -2.316 0.0210 race 0.3615 1.435 0.3122 1.158 0.2500 wexp -0.1200 0.887 0.2135 -0.562 0.5700 mar -0.4236 0.655 0.3822 -1.108 0.2700 paro -0.0982 0.906 0.1959 -0.501 0.6200 prio 0.0794 1.083 0.0293 2.707 0.0068 as.factor(educ)3 0.5934 1.810 0.5196 1.142 0.2500 as.factor(educ)4 0.3284 1.389 0.5437 0.604 0.5500 as.factor(educ)5 -0.1210 0.886 0.6752 -0.179 0.8600 as.factor(educ)6 -0.4070 0.666 1.1233 -0.362 0.7200 Likelihood ratio test=38.7 on 11 df, p=6.01e-05 n= 432, number of events= 114 18,April 2011 Department of Mathematics, ETHZ

Adjusted survival curve • > plot(survfit(mod.allison), ylim=c(.7, 1), xlab=’Weeks’, ylab=’Proportion Not Rearrested’) 18,April 2011 Department of Mathematics, ETHZ

Adjusted survival curve • We may wish to display how estimated survival depends upon the value of a covariate.Because the principal purpose of the recidivism study was to assess the impact of financial aid on rearrest, let us focus on this covariate. • We construct a new data frame with two rows, one for each value of fin; the other covariates are fixed to their median. 18,April 2011 Department of Mathematics, ETHZ

Adjusted survival curve > Rossi.fin <- data.frame(fin=c(0,1), age=rep(median(age),2), race=rep(median(race),2),wexp=rep(median(wexp),2), mar=rep(median(mar),2), paro=rep(median(paro),2), prio=rep(median(prio),2), educ=as.factor(rep(median(educ),2)) > plot(survfit(mod.allison, newdata=Rossi.fin), conf.int=T, lty=c(1,2), col=c(‘red’, ‘blue’), ylim=c(.5, 1), xlab='Weeks', ylab='Proportion Not Rearrested') 18,April 2011 Department of Mathematics, ETHZ

Model Selection

Model Selection • Why variable selection? • Purposeful selection • Stepwise selection • Best Subset Selection of Covariates Departement ofMathematics, ETHZ

Why variable selection? • We generally want to explain the data in the simplest way. • Unnecessary predictors in a model will effect the estimation of other quantities. That is to say, degrees of freedom will be wasted • If model is to be used for prediction, we will save effort, time and/or money if we do not have to collect data for predictors that are redundant. 18,April 2011 Department of Mathematics, ETHZ

Why variable selection? • We must decide on a method to select a subset of variables. • Purposeful selection • Stepwise selection - using P-values - using AIC • Best subset selection 18,April 2011 Department of Mathematics, ETHZ

Purposeful selection • We fit a multivariable model containing all variables that were significant in a univariable analysis at the 20-25% level. • We use the p-values from the Wald statistic to remove variables from our model. We also confirm the non-significance by a likelihood ratio test. • We check whether the removal has produced an “important” change in coefficients of other variables. • We check again all the variables that we removed. • We check for nonlinearity. • We look for interactions. • We check assumptions. 18,April 2011 Department of Mathematics, ETHZ

Stepwise selection • Stepwise selection is a mix between forward and backward selection. • We can either start with an empty model or a full model and add/remove predictors according some criteria. • At each step we reconsider terms that were added or removed earlier. → Often applied in practice → Done argument in the step() function in R → In practice often based on AIC/BIC 18,April 2011 Department of Mathematics, ETHZ

Stepwise selection • The AIC is a measure of the relative goodness of fit of a statistical model. • It does not only reward goodness of fit, but also includes a penalty that is an increasing function of the number of parameters. • AIC = 2k – 2max(loglikelihood), where k is the number of parameters in the model. • This means the smaller the better 18,April 2011 Department of Mathematics, ETHZ

Stepwise selection using our data Step: AIC=1327.35 Surv(week, arrest) ~ fin + age + mar + prio Df AIC <none> 1327.3 - mar 1 1327.7 - fin 1 1329.0 - age 1 1335.4 - prio 1 1336.2 18,April 2011 Department of Mathematics, ETHZ

Best Subset Selection • Stepwise only considers a small number of all the possible models • Best subset provides a way to check all the possible models • The same as in linear regression: need a criterion to judge the models Idea: not only based on goodness-of- fit, but also penalizes for the model size. 18,April 2011 Department of Mathematics, ETHZ

Best Subset Selection • Mallow’s C: C=W+(p-2q) smaller C is better • p: number of variables under consideration • q: number of variables not included in the subset model • W=W(p)-W(p-q), where W(p) is the Wald statistics for the model containing all p variables and W(p-q) denotes the Wald statistics for the subset model 18,April 2011 Department of Mathematics, ETHZ

Best Subset Selection of Covariates • Check the model 18,April 2011 Department of Mathematics, ETHZ

Model Diagnostics

Model Diagnostics • Analyze PH assumption with residuals • Influential observations • Checking nonlinearity Departement ofMathematics, ETHZ

Analyze PH assumption with residuals • We have a strong evidence of non-PH assumption for age • plot with cox.zph shows us plots of scaled Schoenfeld residuals. > cox.zph(mod.allison.4) rho chisq p fin -0.000159 2.99e-06 0.99862 age -0.221020 7.38e+00 0.00659 prio -0.077930 7.32e-01 0.39237 mar 0.131485 2.08e+00 0.14937 GLOBAL NA 8.88e+00 0.06406 18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption with residuals > plot(cox.zph(mod.allison.4)) 18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption with residuals • For the variable age, the plot of residuals changes over time. • There are two possible solutions. • The effect of variable age is different with regards to time intervals: Age is a strata variable • The effect of age is declining over time: Interaction between age and time exists 18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption (Strata) • The variable age is “strata” variable • We separate observations into 4 groups by their ages • [ < 19] • [20 ~ 25] • [26 ~ 30] • [31 < ] > library(car) > Rossi$age.cat <- recode(Rossi$age, “lo:19=1;20:25=2;26:30=3;31:hi=4") > table(Rossi$age.cat) 1 2 3 4 66 236 66 64 18,April 2011 Department of Mathematics, ETHZ

Analyze PH assumption (Strata) • Use this separated age groups as strata variables and check the PH assumption again > mod.allison.6 <- coxph(Surv(week, arrest) ~ fin + prio + strata(age.cat) + mar, data=Rossi) > cox.zph(mod.allison.6) rho chisq p fin -0.0164 0.0315 0.859 prio -0.0721 0.5946 0.441 mar 0.1337 2.0830 0.149 GLOBAL NA 2.7478 0.432 Departement of Mathematics, ETHZ

Analyze PH assumption (Interaction) • To analyze the interaction between age and time, we should transform the data. • Ex) 1st observation, we change (0,20] to (0,1+], (1,2+], (2,3+], . . . , (19,20] > Rossi[1,1:10] week arrest fin age race wexp mar paroprioeduc 1 20 1 0 27 1 0 0 1 3 3 > Rossi2[1:20,1:10] start stop arrest.time week arrest fin age . . . prioeduc 1.1 0 1 0 20 1 0 27 . . . 3 3 . . . 1.19 18 19 0 20 1 0 27 . . . 3 3 1.20 19 20 1 20 1 0 27 . . . 3 3 Departement of Mathematics, ETHZ

Analyze PH assumption (Interaction) • The interaction exists between time and age coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + age:stop + prio + mar, data = Rossi.2) coef exp(coef) se(coef) z p fin -0.35971 0.698 0.19042 -1.89 0.05900 age 0.03552 1.036 0.03899 0.91 0.36000 prio 0.09868 1.104 0.02721 3.63 0.00029 mar -0.50512 0.603 0.37302 -1.35 0.18000 age:stop -0.00371 0.996 0.00145 -2.56 0.01100 Likelihood ratio test=38.1 on 5 df, p=3.55e-07 n= 19809 • Age has a positive partial effect on the hazard but this effect gets smaller with time, even becoming negative effect about 10 weeks. 18,April 2011 Department of Mathematics, ETHZ

Influential observations • For each covariate we look at how much the regression coefficients change if we remove one observation. • In R the argument type=dfbeta to the residuals() function produces a matrix of estimated changes in the regression coefficients upon deleting each observation in turn. • We then plot these changes. 18,April 2011 Department of Mathematics, ETHZ 3 36

18,April 2011 Department of Mathematics, ETHZ 4 37

Influential observations(Just for fun) • Let see what happens if I change some observations. • I take the first age observation and change it. First to age=60 and then to age=110. • See R 18,April 2011 Department of Mathematics, ETHZ 5 38

Checking nonlinearity • Nonlinearity is a problem in Cox regression as it is in linear and generalized linear models. • To detect nonlinearity we plot the Martingale residuals against covariates. • We add a smooth produced by local linear regression using the loess function and try to detect deviations from zero. 18,April 2011 Department of Mathematics, ETHZ 6 39

Martingale residuals • The Martingale residual for individual i on time ti is • Where δi is the event indicator • is the cumulative hazard function for individual i. • ti is the time at the end of follow up for individual i. Departement of Mathematics, ETHz

18,April 2011 Department of Mathematics, ETHZ 7 41

In case of nonlinearity • We try to transform our covariate which is not linear. • We can try several transformations, e.g. log or sqrt. • We can also include higher order terms in our model and compare with the original model using likelihood ratio test. 18,April 2011 Department of Mathematics, ETHZ 8 42

Cox PH Model for Time-Dependent Variables in R

Time-Dependent Variables • Cox PH Model for Time-Dependent Variables • Data Transformation • Model with Time-Dependent Variables • Model with Lagged Time-Dependent Variable 18,April 2011 Department of Mathematics, ETHZ

Cox PH Model for Time-Dependent Variables • Recall the Cox PH model for Time-Dependent Variables • gi(t) which depend on time t can be • 0 (time-independent) • t, ln(t), etc... • One variable at a time • gi(t) = 1 (t = t0, t1, t2, ..) = 0 (otherwise) • Heavyside function • gi(t) = 1 (t ≥ t0) = 0 (t < t0) 18,April 2011 Department of Mathematics, ETHZ

Data Transformation • Now, we want to assess the effect of weekly employment on rearrestharzard. • Weekly employment indicators appear as a single row in 52 columns => Weekly employment indicators in rows > Rossi[1,] week arrest fin age race wexp mar paroprioeduc emp1 emp2 1 20 1 0 27 1 0 0 1 3 3 0 0 emp3 emp4 emp5 emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp13 1 0 0 0 0 0 0 0 0 0 0 0 emp14 emp15 emp16 emp17 emp18 emp19 emp20 emp21 emp22 emp23 1 0 0 0 0 0 0 0 NA NANA emp24 emp25 emp26 emp27 emp28 emp29 emp30 emp31 emp32 emp33 1 NA NANANANANANANANANA emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41 emp42 emp43 1 NA NANANANANANANANANA emp44 emp45 emp46 emp47 emp48 emp49 emp50 emp51 emp52 1 NA NANANANANANANANA 18,April 2011 Department of Mathematics, ETHZ

Data Transformation • Transformed data (Weekly employment indicators in rows) > Rossi.2[1:50,] start stop arrest.time week arrest fin age race wexp mar paroprioeduc employed 1.1 0 1 0 20 1 0 27 1 0 0 1 3 3 0 . . . 1.19 18 19 0 20 1 0 27 1 0 0 1 3 3 0 1.20 19 20 1 20 1 0 27 1 0 0 1 3 3 0 2.1 0 1 0 17 1 0 18 1 0 0 1 8 4 0 2.2 1 2 0 17 1 0 18 1 0 0 1 8 4 0 . . . 2.16 15 16 0 17 1 0 18 1 0 0 1 8 4 0 2.17 16 17 1 17 1 0 18 1 0 0 1 8 4 0 . . . 7.1 0 1 0 23 1 0 19 1 1 1 1 0 4 1 7.2 1 2 0 23 1 0 19 1 1 1 1 0 4 1 . . . 7.22 11 12 0 23 1 0 19 1 1 1 1 0 4 1 7.23 12 13 1 23 1 0 19 1 1 1 1 0 4 0 . . . 18,April 2011 Department of Mathematics, ETHZ

Model with Time-Dependent Variables • We treat weekly employment as a predictor depended on time to rearrest. • Suggested model: • Xemployed (t) means whether people are employed at week t (0 or 1) • We estimate coefficient βi, δemployed 18,April 2011 Department of Mathematics, ETHZ

Model with Time-Dependent Variable • The weekly employment variable has an apparently large effect. • The hazard of rearrest is smaller by a factor of e-1.3289 = 0.265 (declined by 73.5%) when people are on a employed status. coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + mar + prio + employed, data = Rossi.2) n= 19809 coef exp(coef) se(coef) z Pr(>|z|) fin -0.33898 0.71250 0.19037 -1.781 0.07498 . age -0.04598 0.95507 0.02059 -2.233 0.02552 * mar -0.36119 0.69684 0.37334 -0.967 0.33331 prio 0.08419 1.08784 0.02775 3.034 0.00241 ** employed -1.32897 0.26475 0.24979 -5.320 1.04e-07 *** 18,April 2011 Department of Mathematics, ETHZ

Model with Lagged Time-Dependent Variables • Claim: The direction of causality is not clear, because a person cannot work when he is in jail. Arrest at time t Arrest Attime t Ambiguous causality Weekly Employmentat time t Weekly Employment at time t-1 18,April 2011 Department of Mathematics, ETHZ

The Cox model in R

The Cox model in R

Presentation Transcript

Implementing the DDI4 Model in the R Language

the Cox proportional hazards model (Cox Regression Model)

The E-R Model

3.2 Black and Cox Model

Model selection in R featuring the lasso

E-R Model

Model-based Clustering in R

Cox model with additional info on the baseline

E-R Model

MODELLING TIME OF UNEMPLOYMENT VIA COX PROPORTIONAL MODEL

in the R-parity violating SUSY model

Statistical analysis Part VI Cox Proportional Hazard Model

Cox Proportional Hazards Regression Model

E-R Model to R elational Model

Cox

Comparing Cox Model with a Surviving Fraction with regular Cox model

The ( Q , r ) Model

Assessing Survival: Cox Proportional Hazards Model

E-R MODEL

The Enhanced E-R Model