290 likes | 428 Views
Statistical significance for genomewide studies. John D. Storey and Robert Tibshirani Saurabh Paliwal Topics in Bioinformatics class presentation 11/14/06. Outline of presentation. (Multiple) Hypothesis testing Hypothesis testing: terminology Type I and II errors
E N D
Statistical significance for genomewide studies John D. Storey and Robert Tibshirani Saurabh Paliwal Topics in Bioinformatics class presentation 11/14/06
Outline of presentation • (Multiple) Hypothesis testing • Hypothesis testing: terminology • Type I and II errors • Problem of multiple hypothesis testing • Choice of threshold • Methodology of paper : • Motivating biological examples • False discovery rates, q values • Results • Discussion
Hypothesis testing Usual steps in statistical hypothesis testing: • Formulate the Null Hypothesis (H0) and the Alternate hypothesis (H1): • Null hypothesis: Statistical hypothesis that is tested for possible rejection under the assumption that it is true • Alternate hypothesis: The observations are the result of a real effect. (Hypothesis that the data is distributed different from the distribution of the null hypothesis) • Identify a test statistic that can be used to assess the truth of the null hypothesis • Compute the p-value: Probability that a test statistic at least as extreme as the one observed would be obtained assuming that the null hypothesis is true • Compare the p-value to an acceptable significance level . If then the observed effect is called statistically significant, the null hypothesis is rejected in favor of the alternative hypothesis
Type I and II errors, other terms • Type I error / α error / false positive : the error of rejecting a null hypothesis i.e. the test statistic is sufficiently extreme, when it is in fact true • Type II error / β error / false negative : the error of not rejecting the null hypothesis when the alternative hypothesis is true • Power of a test / Sensitivity = 1-β i.e. the probability of not committing a type II error • Specificity = 1- α i.e. the probability of not making a type I error • Critical region/ rejection region: A set of values (greater than threshold) of the test statistic for which the null hypothesis is rejected
Some common pitfalls • If the test statistic is outside the critical region (less than threshold), the only conclusion is that ‘there is not enough evidence to reject the null hypothesis’. This is NOT the same as evidence in favor of the null hypothesis. In other words, failing to find evidence that there is a difference does not constitute evidence that there is no difference. • The p-value is NOT the probability that a finding is “merely a fluke” • The p-value is NOT the probability that the null hypothesis is true, or (1-(p-value)) is NOT the probability of the alternative hypothesis being true • The significance level of the test is NOT determined by the p-value. It is decided upon before any of the data are collected • A statistically significant result is not always of practical significance / does not necessarily demonstrate a large effect in the population. Given a sufficiently large sample, extremely small and non-notable differences can be found to be statistically significant
some more terms… • One-sided / one-tailed test: Values for which we can reject the null hypothesis are located entirely in one-tail of the probability distribution • Two-sided / two-tailed test: Values … are located in both tails of the probability distribution
Multiple hypothesis testing • m hypothesis tests • H1 = 0 vs H1 = 1 • H2 = 0 vs H2 = 1 • …. Hm = 0 vs Hm = 1 • Want to make simultaneous inference • Each test has possible Type I and Type II errors and there are many possible ways to combine them
The multiple testing problem • Assume all the test statistics are null. As the number of independent applications of the hypothesis testing criterion grows, it begins to outweigh the high unlikelihood associated with each individual test. It becomes increasingly likely that that one will observe data that satisfies the rejection criterion by chance alone (even if the null hypothesis is true in all cases). Thus, there is an increase in the number of ‘false positives’. • Example of flipping coins: A coin is declared biased if in 10 flips, it lands on heads at least 9 times. • If the null hypothesis is that the coin is fair, the likelihood that a fair coin would come up heads at least 9 times out of 10 is 11 / 210 = 0.0107 : pretty unlikely! • Multiple comparison: want to test the fairness of many coins, say 100, by this method (flip each 10 times). The likelihood that all 100 fair coins are identified as fair is (1 – 0.0107)100 = 0.34 : pretty likely (0.66) that some will be identified as biased! (note that the probability of seeing a pre-selected coin do this is still 0.0107) • In n independent comparisons are performed, the probability that one or more false positives will be detected is given by It increases as the number of comparisons increase (in fact, it will go to 1 eventually).
All quantities except m and S are unobservable • m0 = truly null, m1 = truly alternative • Regardless of whether the p-value threshold is fixed or data-dependent, the quantities F, T and S are random variables. Hence, it is common statistical practice to write the overall error in terms of an expected value, E[·] • Specificity = m0 - F / m0 • Sensitivity = T / m1
How does one choose a threshold? • Control the Per-Comparison Type I error (PCER) • i.e. “uncorrected testing”, too many false positives • Gives P(Fi = 1) ≤ α marginally for all 1 ≤ i ≤ m • Control the Familywise Type I error (FWER) • E.g. Bonferroni correction: can guarantee that P(F ≥ 1) ≤ α by setting individual test p-values ≤ α/m. • Follows from Boole’s inequality : • May be appropriate if very few features are expected to be truly alternative (e.g. linkage analysis) • However, typically it is much too conservative for a number of applications e.g. genomewide studies involving differentially expressed genes, fMRI studies etc • Control the False Discovery Rate (FDR) • Guarantees that the FDR = E [F / F+T] = E [F/S] ≤ α • Sensible balance between the number of false positive features, F and the number of true positive features, T • More later…
Biomedical significance of the multiple testing problem • Example 1 : Detecting differentially expressed genes : Hedenfalk et.al., N. Engl.J.Med. 2001 • Detect differential expression of genes (features in this case) between BRCA1 and BRCA2 mutation-positive tumors • Computed a modified F statistic, which was used to assign a p value to each gene • p-value of 0.001 was selected to find 51 differentially expressed (DE) genes out of 3226 (~ 3 false positives expected) • More conservative threshold of 0.0001 yielded 9-11 DE genes • Example 2 :Identifying Exonic Splicing Enhancers : Fairbrother et.al., Science, 2002 • Exonic splice enhancers are short oligonucleotide sequences that enhance pre-mRNA splicing when present in exons • They analyzed human genomic DNA to predict exonic splice enhancers based on the statistical analysis of exon-intron and splice site composition • Used a p-value associated with 4096 possible hexamer sequences. Cutoff of 10-4 results in an expected value of < 1 false positive. • 238 significant hexamers were subsequently biologically verified
Example 3: Genetic dissection of transcriptional regulation, Brem et.al., Science, 2002 • Statistically significant linkage between a gene’s expression level and a marker indicates that a regulator for that gene is located in the region of the marker • Tested each of 6215 genes for linkage to at least one locus, resulted in 6215 p values • p-value cutoff of 8.5x 10-3 was used, and 507 genes showed linkage to at least one locus, where 53 are expected by chance • p-value cutoff of 1.6x 10-4 : 205 genes (where 1 is expected by chance) • Example 4: Finding binding sites of transcriptional regulators: Lee et.al., Science, 2002 • Transcriptional regulatory proteins bind to specific promoter sequences to participate in the regulation of gene expression • Binding of 106 transcriptional factors was analyzed all over the genome. At each genomic location, a p value was calculated under the null hypothesis that no binding occirs, resulting in thousands of p values • At a p-value of 0.001, they estimate 3985 interactions, ~6-10% are false positives
fMRI applications • Compare two sets of conditions and using statistical methods, analyze the difference in ‘brain activity’ in particular parts of the brain • 100,000 voxels in fMRI • Low signal to noise ratio in images • High number of features will lead to enhanced number of false positives, and a huge difficulty in recognizing the actual area of interest • Bonferroni correction is far too conservative
p values and q values • The p value is a measure of significance in terms of the false positive rate i.e. the rate that truly null features are called significant • The q value is a measure in terms of the false discovery rate i.e. the rate that significant features are truly null • Note the difference between FPR and FDR • Hence, a p-value cutoff says little about the content of the features actually called significant
(Positive) False Discovery Rate (p)FDR • False discovery rate: • There is the possibility that S = 0, in which case F/S is undefined. Hence, define the positive False discovery rate (3 possible formulations discussed in Benjamini and Hochberg, 1995 : R is the same as S, and V is the same as F): • For the purposes of the paper, first concentrate on the assumption of S > 0 (S = 0 case will be discussed at the end) • In terms of specificity and sensitivity , one can write the FDR as: Commonly referred to as FDR Commonly referred to as pFDR
The FDR is a measure of the overall accuracy of the set of features declared to be significant • The q value is a measure that reflects the significance that can be attached to each individual feature. • The q value of a certain feature can be described as the expected proportion of false positives among all features as or more extreme than the observed one. (Similar to the p-value definition as the probability of a null feature being as or more extreme than the observed one)
Methodology • The authors first calculated a p-value for each of the 3170 genes of Example 1 (Using a two sample t statistic*) • Plotted a density histogram of the 3170 p values • Order the p-values in increasing order of magnitude. For some threshold t, where 0<t<1, all the features with p values less than t are called significant. Let these m p values be
Estimate FDR(t) as Since m is very large, this can be approximated as (proved later on) • Simple estimate of E[S(t)] is the observed S(t) i.e. number of observed p values ≤ t • The probability a null p value is ≤ t is simply t. Hence, E[F(t)] = m0 . t However, since m0 is unknown, it has to be estimated. • Define the ratio of features that are truly null to total features = m0 / m = π0 Need to specify the distribution of the truly alternative p values to estimate π0 However, since the null p values are uniformly distributed, we can get an estimate Aside: Note that if all the genes were null (not differentially expressed, the density histogram would look like this
Estimate π0 in terms of a tuning parameter λ The rationale behind this estimate: p values of truly alternative features will be close to 0, while p values of null features will be uniformly distributed among [0,1]. ‘Most’ of the p values near 1 will be null. An unbiased estimate of π0 would be (assuming that we could count only null values). However, presence of a few alternative p values only makes the estimate conservative There must be mostly null p values in this region of the graph (p> λ), where λ = 0.5 Conservative estimate of overall proportion of p values
Notice the high variance in the estimate here. It makes it necessary to estimate π0 using a cubic spline Bias–Variance tradeoff: As λ is closer to 1, the bias is lower (because lesser and lesser number of alternative p values will be found there), but the variance of the estimate increases (because lesser number of points are being used to estimate π0) …and vice-versa
Using the above estimate for π0 to estimate FDR(t) as • Estimate the q value of feature i as • Thus, the mathematical definition of the q value is the minimum FDR that can be attained when calling that feature significant • The above method ensures that the estimated q values are increasing in the same order as the p-values • Suppose that each feature’s statistic probabilistically follows a random mixture of a null distribution and an alternative distribution. Then the pFDR can be written as Pr (feature i is truly null | feature i is significant) ~ q value • False positive rate is Pr(feature i is significant | feature i is truly null) ~ p value The p value can be thought of as the minimum possible false positive rate when calling the feature significant
Results • The initial estimate (based on a tuning parameter of λ=0.5) is that 33% of the examined genes were differentially expressed in the study of example 1 (Hedenfalk et.al.) • Thresholding genes with q values ≤ 0.05 yields 160 genes significant for differential expression (~8 genes are expected to be false positives). 117 of these 160 were found to be overexpressed in BRCA1-mutation-positive tumors. • Since all q values can be considered simultaneously, we can use several plots to help us calibrate the q-value cutoff that should be applied in a study based on curves of the form shown in (b), (c) and (d) on the right
further interpretation of results • Assume a gene, say MSH2, whose p value is 5.05 x 10-5 and a q value of 0.013. The former implies that the probability that a null (nondifferentially expressed) gene would be as or more extreme than MSH2 is 5.05 x 10-5. The latter on the other hand suggests that ~0.013 of the genes that are as or more extreme than MSH2 are false positives. • Note: q value is not the probability that the feature (say MSH2) is a false positive. • Intuitively, the probability that MSH2 is a false positive is higher than that of the other genes which are more significant than MSH2, thus it is like a “local FDR”. • The q value takes into account multiple features simultaneously (every feature as or more extreme will also be significant), which is important when assigning multiple measures of statistical significance.
Analysis of Hedenfalk et.al. data • Data consisted of 3226 genes on n1 = 7 BRCA1 arrays and n2 = 8 BRCA2 arrays. Disregarded some genes that had measurements that were several interquartile ranges away from the interquartile range of all of the data. • The expression value from array j and gene i is denoted by xij. Then, the sample mean and variance for gene i are given by • The two-sample t statistic for gene i allowing for different variances of the gene in the two tumors is given by • The null versions of t1…t3170are calculated by a permutation method: the labels on the arrays are randomly scrambled and the t statistics are recomputed for B = 100 permutations. The p value for gene i was calculated as :
Critical discussion • Used t-statistic for calculation of p values instead of the f-statistic used in the Hedenfalk paper. Could have additionally used the same values as in that paper to make the comparison easier in terms of number of genes detected as significant etc. • Do not show whether the increased number of genes found are significant for differentiating between BRCA1 positive or negative samples, or BRCA2 positive or negative samples • The assumption of null p values being uniform is critical to the algorithm (as mentioned by the authors too). However, it would be interesting to see how they would handle it if the null p value distribution was different • The assumption is that as m→∞, their procedure controls the FDR asymptotically (i.e. if all features with q ≤ α are taken, then FDR ≤ α for large m). Another recent paper Benjamini, Krieger, Yekutieli (BKY) 2004 suggested that many of the cases of practical interest may not have such a high m value, so this may not be as relevant for those cases • The effect of dependency of the various features has not been investigated enough, it could potentially be very important. BKY 2004 find that FDR can be almost double the bound.
Parting thought…. "... surely, God loves the .06 nearly as much as the .05." (Rosnell and Rosenthal 1989)