Statistical Issues in the Design of a Trial, Part 2

Statistical Issues in the Design of a Trial, Part 2 Karen Pieper, MSDuke Clinical Research Institute

Primary vs. Secondary Hypotheses In most studies, we determine one comparison (maybe 2 or 3) that will be our primary comparison of interest. All other comparisons are considered secondary.

Stating the Primary vs. Secondary Hypotheses What does this mean? • We are saying that we are designing our study to answer question X. Whatever results are achieved, the study is valid, and so the results are conclusive (either positive or negative). However, all other endpoints evaluated are not necessarily conclusive but rather are important for generating new hypotheses.

The Multiple Comparisons Problem Why do we need to do this?

Multiple Comparisons Flip a coin.What is your chance of getting a head (H)?50% Now flip a coin 2 times.What is the chance that you get at least one head?HH HT TH TT¾ or 75%

Multiple Comparisons Note, the formula for this is: 1 - (probability of a tail)number of tosses 1 - (0.5)2=1 - 0.25=0.75

Multiple Comparisons When we perform a test for a clinical study, we prespecify that results will be considered significant only if there is no more than a 5% chance that an effect will be found when, in fact, there really is no effect. (Type I error)

Multiple Comparisons If we perform 2 tests, then the probability of erroneously declaring at least one of them statistically significant is: 1 - 0.952 = 1 - 0.9025 = 0.0975 If we perform 10 tests, then the probability of erroneously declaring that at least one of them is statistically significant is: 1 - 0.9510 = 1 - 0.5987 = 0.401

Multiple Comparisons To avoid this problem, we can: Only perform one test • In this case, we specify the primary question that we want to answer. Then we specify a list of secondary questions. • In doing this, we are claiming that the primary question or hypothesis is the only one about which we will make conclusive statements. • The other questions (or secondary hypotheses) are of interest but are not conclusive. We often refer to these secondary questions as “hypothesis generating.”

Multiple Comparisons Or, we can: Decrease the probability that we are willing to accept for a Type I error • Say we only declare significance if our test probability is < 0.01. In this case, if we do 5 tests, then we have: 1 - (0.99)5 = 1 - 0.95 = 0.05 = the probability of making a Type I error in the study

Calculating Sample SizeMethod A • Choose a number of dollars • Calculate the number of dollars required per patient enrolled • Divide one number into the other

Calculating Sample SizeMethod B • Develop consensus among providers, patients, and payers on the difference in outcomes needed to change clinical practice (a.k.a. minimally important difference or MID) • Estimate standard therapy event rate • Calculate number of patients required to demonstrate MID with low probability for missing a real difference

Sample Size Estimation Sample size calculations depend on: • Type I error rate • Type II error rate • Endpoint to be analyzed • Statistical method to be used in analyzing the endpoint • Estimated value for the endpoint one expects to see in the control arm • Estimated improvement one expects to see in the treatment arm • Amount of variation in the endpoint measured

Sample Size EstimationType I and Type II Errors Test Results No Tx Tx Has Effect Effect No Tx Effect Type I Error Tx Has Effect Type II Power Error Truth

Sample Size Estimation Sample size calculations depend on: • Endpoint to be analyzed • Yes/no responses • Continuous responses • Questionnaire data • Survival from event following a long period of time • Repeated measures of an outcome over several weeks • Etc.

Sample Size Estimation Sample size calculations depend on: • The technique used to analyze the endpoint • The formula used to calculate a sample size is based on the statistical test one plans to use in the final analysis. • Different tests often involve different assumptions. • Each formula for calculating sample size will give a somewhat different answer.

Sample Size Estimation Sample size calculations depend on: • The estimated value for the endpoint one expects to see in the control arm • The rarity of the endpoint. The rarer the endpoint, the more patients it takes to detect a difference. Estimates of control rates come from: • Previous studies in the literature • Pilot studies • “Best clinical guess”

Sample Size Estimation Sample size calculations depend on: • The estimated improvement one expects to see in the treatment arm • The greater the improvement one expects to see, the fewer the patients required. • The amount of variation in the endpoint measure • If your measure is a continuous measure (for example, volume measures, blood pressures, ejection fractions, percentages, weight), then less variation in the measure means that fewer patients are needed to detect a difference.

Sample Size: The Effect of Properly Estimating Treatment Effects Example:Impact II Primary Endpoint in Treated Patients Changing Low Dose Placebo Low-dose Arm (n = 1300) (n = 1285) p-value Actual results 118 (9.1%) 149 (11.6%) 0.035 Adding 1 event 119 (9.2%) 149 (11.6%) 0.042 Adding 2 events 120 (9.2%) 149 (11.6%) 0.049 Adding 3 events 121 (9.3%) 149 (11.6%) 0.057

Sample Size: Summary You can decrease the sample size needed by: • Allowing for a bigger Type I error • Allowing for a bigger Type II error • Increasing the level of improvement one expects to achieve • Choosing a more powerful way of testing • For a binary endpoint, choosing the one that is closest to 50% in likelihood of being observed in the control arm • For survival endpoints, extending the length of follow-up • For continuous measures, decreasing the variation in the outcome

Statistical Terms

Point Estimate The statistic one calculates to estimate the result of interest • Examples: • Percent of patients with the event • Mean of the outcome • Kaplan-Meier rate of survival • Ratio of percentages • Differences in percentages

Odds Ratio Example: PURSUIT trial: ACS patients; primary endpoint of CEC-adjudicated MI or death at 30 days Eptifibatide Placebo 4722 Total Patients 4739 Total Patients672 Events 745 Events Odds in the Integrilin group: 672 / 4050 = 0.166Odds in the placebo group: 745 / 3994 = 0.187Odds ratio: 0.166 / 0.187 = 0.889

Risk Ratio Example: PURSUIT trial: ACS patients; primary endpoint of CEC-adjudicated MI or death at 30 days Eptifibatide Placebo 4722 Total Patients 4739 Total Patients 672 Events 745 Events Risk in the Integrilin group: 672 / 4722 = 14.2%Risk in the placebo group: 745 / 4739 = 15.7%Risk ratio: 14.2 / 15.7 = 0.905

Risk Difference = Risk in Group A - Risk in Group B Risk in the Integrilin group: 672 / 4722 = 14.2% Risk in the placebo group: 745 / 4739 = 15.7% Risk difference = 15.7% - 14.2% = 1.5% The treatment has saved 1.5% of those treated from having the event.

Number Needed to Treat Number of patients who need to be treated to prevent one bad outcome • Formula is 1 / absolute risk reduction • Risk reduction in the previous slide was 1.5%, so the number needed to treat to prevent one 30-day death or MI is 1 / 0.015 = 67 patients.

Percent Change = (Risk in Treated - Risk in Control) / Risk in Control • Risk in the Integrilin group: 672 / 4722 = 14.2% • Risk in the placebo group: 745 / 4739 = 15.7% • Risk difference / control risk = (15.7% - 14.2%) / 15.7% = 1.5% / 15.7% = 0.09554 = 9.6% change

Mean The average across a group of patients The 50th percentile. This is the value such that half the group falls below it and half the group falls above it. Median

Variance A measure of how far the data fall from the mean To calculate, take each patient’s data and subtract it from the mean. Square the difference to get rid of the negative signs. Add up all of these squared deviations and divide the total by N to obtain the average squared deviation. Standard Deviation The square root of the variance Variance is on a squared scale. Taking the square root puts it back on the same scale as the original values.

Example

Box-and-whisker Chart Baseline Post 2 4 18–24 PTCA Hours Hours Hours

Just how magical is 0.05? P-values

P-values A p-value is a probability. It is the probability of obtaining the existing results or even more extreme results if the effect observed is really due to random chance alone.

P-values • For example, in PURSUIT, the primary results were: 30-day death/MI with: Eptifibatide 14.2% Placebo 15.7% P-value 0.042 • This p-value indicates that a difference of at least this much would occur in fewer than 42 out of 1000 similar experiments if eptifibatide had no effect on death or MI out to 30 days.

P-values • For the primary hypothesis, we usually use a critical value of 0.05. • According to this rule, if we complete the study and: • Get a p-value of 0.051, then we cannot declare a statistically significant difference between the two groups • Get a p-value of 0.049, then we can declare a statistically significant difference between the two groups

P-values Summary • Many statisticians involved in research consider p-values that are close to 0.05 (on either side) to be “borderline” in significance or they say that there is a “trend” towards a significant difference. • The further a p-value is from 0.05, the more one believes that it is a true effect (when smaller than 0.05) or that there is no true difference in the groups (when larger than 0.05).

Confidence Intervals Definition of a 95% confidence interval (C.I.): If you were to do the study an infinite number of times, then 95% of the estimates of effect would fall within the bounds of the interval.

0.5 1.0 1.5 Ratio Plot (“Blobogram”) Point estimate of the effect(size and # patients) 95% C.I. Tx BBetter Tx ABetter

0.5 1.0 1.5 Ratio Plot (“Blobogram”) Tx A Better than Tx B Uncertain (p > 0.05) Tx BBetter Tx ABetter

0.5 1.0 1.5 Ratio Plot (“Blobogram”) Tx B (new Tx) better than Tx A Tx B probably better than Tx A, but may be equivalent Tx B may be worse than Tx A, but may be statistically and clinically equivalent Tx B may be worse than Tx A or may be equivalent (or better!) Tx B worse than Tx A Tx BBetter Tx ABetter

Confidence Intervals Death or MI at 30 Days 0.89 (0.79, 0.99) 0.75 (0.63, 0.91) 0.92 (0.77, 1.11) 1.03 (0.60, 1.76) 1.09 (0.85, 1.39) n = 4358 n = 4243 n = 585 n = 1762

The Effect of Fewer 30-day Death/MIs on P-values and Confidence Intervals in the Eptifibatide Arm

Relationship Between P-values and C.I. 1. Decide how you want to look at the treatment effect (the point estimate). • Difference:15.7 - 14.2 = change of 1.5 in death/MI rates • Percent Change:1.5 / 15.7 = 9.5% decrease in rates • Odds Ratio:0.89 odds of death/MI for Integrilin vs. placebo

Relationship Between P-values and C.I. • The p-value and 95% confidence interval are calculated using exactly the same measures from the data. • Usually, they each use the same measure of effect size and the same measure of how much variation existed in the outcome in the study. • Each shows the chances that the results are attributable to random chance alone.

Superiority Trials These trials test for statistically significant and clinically meaningful improvements (or harm!) from the use of the experimental treatment over the results obtained through the use of standard care.

Superiority Trial Results MID MID: Minimally important difference Study A Study B Study C Study D 0 1 2 Experimental Tx Better Control Tx Better

Equivalence • Equivalence studies are designed to evaluate whether the difference in outcomes for the new treatment compared to those obtained with standard care falls within the boundaries of the minimally important difference (MID). • MID is the largest difference one will accept between the outcomes of 2 groups and still consider them clinically similar.

Equivalence Trial Results MID MID Study A Study B Study C Study D 0 1 2 Experimental Tx Better Control Tx Better

Equivalency in Cardiovascular Drug Development • As mortality decreases, larger sample sizes are needed to demonstrate a relative risk reduction • New therapies that may be similar to or slightly better than existing therapies may be important to: • Improve ease of use • Reduce cost • Make small advances • Underpowered studies failing to show a difference with a new therapy may miss a “true” worse outcome

Equivalency in Cardiovascular Drug Development • Failing to demonstrate a difference is not the same as proving that no difference exists. • Absolute equivalence is impossible to prove; there is always some degree of uncertainty. • The goal is to refute the hypothesis that the treatments lead to different outcomes by at least the margin of the MID.

Statistical Issues in the Design of a Trial, Part 2

Statistical Issues in the Design of a Trial, Part 2

Presentation Transcript

Statistical Issues in the Evaluation of Predictive Biomarkers

Selected Issues in Oncology Trial Design

Statistical Topic: Ethics and Trial Design*

Assorted Issues in the Closeout of a Trial

Statistical Analysis – Part 2

Statistical Issues in Specification of D

Statistical issues in the validation of surrogate endpoints

Statistical Forecasting [Part 2]

Clinical Trial Design and other Statistical Issues

Database Design Issues, Part I

Part 2: Environmental Issues

Design of Engineering Experiments Part 2 – Basic Statistical Concepts

Clinical Trial Design Issues in Development of Topical Microbicides

Statistical Issues in Trial Reporting: Revised US FDA Regulations

Statistical Issues in the Design of Microarray Experiments

Statistical Issues in the Design of Microarray Experiment

Design of Engineering Experiments Part 2 – Basic Statistical Concepts

Part 2 Statistical Mechanics

Basics Of Statistical Reasoning - Part 2 - Edukite

Ethical Issues Part 2

Statistical Issues