Lecture 9: Hypothesis Testing

Lecture 9: Hypothesis Testing One sample tests >2 sample

Hypothesis Testing for One-Sample • Standard set-up • What is q ? • Common approach • Assume distribution is exponential • Test that distribution is exponential with q = q0

Pretty Stringent • Actually • As long as the hazard is specified for the range of t, tests can be performed

General Form of Test When H0 is true When H0 is true; assuming large N Note: this is a one-sided test to test h(t) > h0(t)

Log-Rank • W(ti) = Y(ti) (the most popular choice of weight function) • Y(ti) is the number of individuals in the risk set at time ti

Accounting for Left-Truncation • Choice of weights is still W(t) = Y(t)

Other Options • Harrington and Fleming • WHF(t)=Y(t)*S0(t)p *[1-S0(t)]q, where p,q≥0 and S0(t)=exp(-H0(t)) • Allows user to have flexibility in weighting • Can choose early (p>>q) or late (p<<q) departures or departures in the mid-range (p=q>0) from the null hypothesis to be more influential • Special case: log-rank test, p = q= 0

Notes • An estimator of the variance, V, can be the empirical estimate rather than the hypothesized value • When the alternative, h(t) > h0(t) is true, this variance estimator is expected to be larger and the test less powerful • If h(t) < h0(t) then this variance will be smaller and the test more powerful

Example: Rheumatoid Arthritis • 10 white males with RA followed for up to 18 years • Objective: • Determine if men with RA are at greater risk of mortality

Test statistics =

Bone Marrow Transplant for Leukemia (example 1.3 in the book) • Patient undergoing bone marrow transplant (BMT) for acute leukemia • Three types of leukemia • ALL • AML low risk • AML high risk • What if we are interested in overall incidence rate (i.e. either relapse or death) across all three leukemia types

Estimated KM survival probability for all incidence (i.e. both death and TRM)

BMT Example • Want to test whether or not survival in BMT patients follows an exponential distribution • What does this mean we are asking? • Can estimate l from the data (recall the MLE for an exponential distribution)

R Code ### BMT example data<-read.csv("H:\\public_html\\BMTRY722_Summer2019\\Data\\BMT_1_3.csv“) failtime<-ifelse(data$Relapse==0 & data$Death==0| data$Relapse==1, data$TTR, NA) failtime<-ifelse(data$Death==1 & data$TTR>=data$TTD, data$TTD, failtime) event<-ifelse(data$Relapse==1| data$Death==1, 1, 0) st<-Surv(failtime, event) fit<-survfit(st~1) # empirical survival function plot(fit, xlab="Time", ylab="S(t)", lwd=2) #Calculating lambda hat for estimated hazard rate lambda.hat<-sum(event)/sum(failtime)

“survdiff” Function Description Tests if there is a difference between two or more curves using the G-rho family of tests, or for a single curve against a known alternative Usage survdiff(formula, data, subset, na.action, rho=0) Arguments formula: a formula expression as for other survival models, of the form Surv(time, status)~predictors. For a one-sample test, the predictors must consist of a single offset(sp) term, where sp is a vector giving the survival probability for each subject

“survdiff” Function Method This function implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)^rho, where S is the Kaplan-Meier estimate of survival. With rho=0 this is the log-rank or Mantel-Haenszel test, and with rho=1 it is the equivalent to the Peto & Peto modification of the Gehan-Wilcoxon test. If the right hand side of the formula consists only of an offset term, then a one sample test is done. To cause the missing values in the predictors to be treated as a separate group, rather than being omitted, use a factor function with its exclude argument.

R code #Estimating lambda >lambda.hat<-sum(event)/sum(failtime) # Expected S(t) = exp(-lambda.hat*t) > S.exp<-exp(-lambda.hat*failtime) > one.sample.test<-survdiff(st~offset(S.exp)) # default rho is 0 i.e. log-rank test > one.sample.test1 Observed Expected Z p 83 83 0 1 > one.sample.test2<-survdiff(st~offset(S.exp), rho=1) > one.sample.test2 Observed Expected Z p 83 83 0 0.00521 #Comparing hypothesized dist’n to empirical dist’n > plot(fit, conf.int=F, lwd=2) > lines(sort(failtime), rev(sort(S.exp)), col=2, lwd=2, type="s")

R code #Estimating lambda for failure times <800 > fail2<-failtime[which(failtime<800)] > event2<-event[which(failtime<800)] > lambda.hat2<-sum(event2)/sum(fail2) # Expected S(t) = exp(-.004*t) > S.exp2<-exp(- lambda.hat2 *fail2) > st2<-Surv(fail2, event2); fit2<-survfit(st2~1) > one.sample.testa<-survdiff(st2~offset(S.exp2)) > one.sample.testa Observed Expected Z p 80 80 0 1 > one.sample.testb<-survdiff(st2~offset(S.exp2), rho=1) > one.sample.testb Observed Expected Z p 80 80 0.000 0.477

R code #Estimating lambda for failure times >800 > fail3<-failtime[which(failtime>=800)] > event3<-event[which(failtime>=800)] > lambda.hat3<-sum(event3)/sum(fail3) # Expected S(t) = exp(-.004*t) > S.exp3<-exp(- lambda.hat3*fail3) > st3<-Surv(fail3, event3); fit3<-survfit(st3~1) > one.sample.testc<-survdiff(st3~offset(S.exp3)) > one.sample.testc Observed Expected Z p 3 3 -2.56e-16 1 > one.sample.testd<-survdiff(sts~offset(S.exp3), rho=1) > one.sample.testd Observed Expected Z p 3 3 -0.035 0.9730

Conclusions • So what can we conclude about our original hypothesis?

Relevance • Becoming more common • Phase II cancer studies with TTE outcomes instead of response • But • Often more interested in median or 1 year survival • Yet • Very important for sample size considerations • Most often assume study data will have exponential distribution for sample size

On to something more interesting… comparing >2 samples

Comparing two or more samples • Anova type approach • Where t is the largest time for which all groups have at least one subject at risk • Data can be right-censored (and left truncated) for the tests we will discuss

Notation • Let t1 < t2 < … < tDbe distinct death times in all samples being compared • At time ti, let dijbe the number of events in group j out of Yijindividuals at risk (j = 1,2,…,K) • Define

Rationale • Weighted comparisons of the estimated hazard of thejth population under the null hypothesis and alternative hypothesis • Based on Nelson-Aalen estimator • If the null is true, the pooled estimate of h(t) should be an estimator for hj(t)

Applying the Test • Let Wj(t) be a positive weight function s.t. Wj(t) = 0 if Yij = 0 • If all Zj(t)’s are close to zero, then little evidence to reject the null

Common Form for Weight Functions • All commonly used tests choose weight functions s.t. • Note that weight is common across allj • Can redefine Z:

Test Statistic • Variance and covariance of Zj(t) (K&M p. 207) • Z1(t) , Z2(t) , ..., ZK(t) are linearly dependent because their sum is 0 • For test statistic, choose K – 1 components • Chi-square test with K – 1 d.f. where S-1 is the variance-covariance matrix

Log-Rank Test for 2 Groups • For log-rank W(ti)=1 • Have 2 groups and want to test if survival is the same in the groups • We want to develop a nonparametric test of

Log-Rank Test for 2 Groups • If and follow some parametric distribution and are in the same family, this is easy • For example assume • But need a test whose validity doesn’t depend on parametric assumptions

Constructing the Log-Rank Test • Recall our notation • t1 < t2 < … < tDare D distinct ordered event times • Yij= # people in the group j at risk at ti • Yi = # people at risk across groups at ti • dij = # of people in group jthat fail at ti • di= # of people in across groupsthat fail at ti

Constructing the Log-Rank Test • We can summarize the information at time ti in a 2x2 table

Constructing the Log-Rank Test

Toy Example • Say we have the following data on two groups: • We want to test the hypothesis

Toy Example

Same Test in R > time<-c(3,6,9,9,11,16,8,9,10,12,19,23) > cens<-c(1,0,1,1,0,1,1,1,0,0,1,0) > grp<-c(1,1,1,1,1,1,2,2,2,2,2,2) > grp<-as.factor(grp) > > sdat<-Surv(time, cens) > survdiff(sdat~grp) Call: survdiff(formula = sdat ~ grp) N Observed Expected (O-E)^2/E (O-E)^2/V grp=1 6 4 2.57 0.800 1.62 grp=2 6 3 4.43 0.463 1.62 Chisq= 1.6 on 1 degrees of freedom, p= 0.203

Same Test in R > names(toy) [1] "n" "obs" "exp" "var" "chisq" "call" > toy$obs [1] 4 3 > toy$exp [1] 2.566667 4.433333 > toy$var [,1] [,2] [1,] 1.267778 -1.267778 [2,] -1.267778 1.267778 > toy$chisq [1] 1.620508

UMP Tests

More general: 2 samples • We can change the weight function • For K = 2, can use Z-score or c2 Corrects for ties

Choice for Weight Functions • W(t) = 1 • Log-rank test • Optimal power for detecting differences when hazards are proportional • Wi(t) = Yi • Gehan test • Generalization of 2-sample Mann-Whitney-Wilcoxon test

Choices for Weight Functions • Fleming-Harrington • General case • Special cases • Log-rank: q = 0 • Mann-Whitney-Wilcoxon: p = 1, q = 0 • q = 0, p > 0: gives greater weight to early departures • p = 0, q > 0: gives greater weight to late departures • Allows specific choice of influence (for better or worse!)

Others? • Many • Not all available in all software (e.g. Gehan not in R) • Worth trying a few in each situation to compare inferences

Caveat • Note we are interested in the average difference (consider log-rank specifically) • What if hazards cross? • Could have significant difference prior to some t, and another significant difference after t: but what if direction differs?

Next time • More on different weight functions • Tests for trends

Lecture 9: Hypothesis Testing

Lecture 9: Hypothesis Testing

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7