1.36k likes | 1.95k Views
Linear models and Limma. Københavns Universitet, 19 August 2009 Mark D. Robinson Bioinformatics, Walter+Eliza Hall Institute Epigenetics Laboratory, Garvan Institute. (with many slides taken from Gordon Smyth). Morning Theory Introduction Background correction Moderated t-tests
E N D
Linear models and Limma Københavns Universitet, 19 August 2009 Mark D. Robinson Bioinformatics, Walter+Eliza Hall Institute Epigenetics Laboratory, Garvan Institute (with many slides taken from Gordon Smyth)
Morning Theory Introduction Background correction Moderated t-tests Simple linear models Morning Practical Demonstration of smoothing Limma objects (beta7) Background correction and normalization (beta7) Simple experimental designs 2-colour example (beta7) Affymetrix example (cancer) Afternoon Theory More advanced designs / linear modeling Moderated F-tests Gene set tests Other analyses limma can do Afternoon practical Factorial design (estrogen) Gene set testing (cancer) Time course experiment (SAHA/depsipeptide) Limma = linear models for microarray data
Expression measures array yga = log2(R/G) Two-colour probe or gene Affymetrix log-intensity(summarized over probes) yga = log-intensity(summarized over beads) yga = Illumina
Questions of Interest • What genes have changed in expression? (e.g. between disease/normal, affected by treatment) Gene discovery, differential expression • Is a specified group of genes all up-regulated in a particular condition?Gene set differential expression • Can the expression profile predict outcome?Class prediction, classification • Are there tumour sub-types not previously identified? Do my genes group into previously undiscovered pathways?Class discovery, clustering Today will cover first two questions - differential expression
MicroarrayDifferential Expression Studies • 103 – 106 genes/probes/exons on a chip or glass slide • Inputs to limma: log-intensities (1-colour data) or log(R/G) log-ratios (for 2-colour data) • Several steps to go from raw data to table of “expression”: background correction, normalization • Idea: Fit a linear model to the expression data for each gene
Two colour microarrays http://en.wikipedia.org/wiki/DNA_microarray
Two-colour data: Log-Intensities For each probe: Various ways to calculate background. Will often modify to ensure: Rf – Rb >0 and Gf – Gb > 0.
Two-colour data: Means and Differences For each probe: “minus” “add”
G1 R1 G2 R2 G3 R3 G4 R4 M3 M2 M4 M1 A1 A2 A3 A4 Data Summaries For each gene
Log-Ratios orSingle Channel Intensities? • Tradition analysis, treats log-ratios M=log(R/G) as the primary data, i.e., gene expression measurements are relative • Alternative approach treats individual channel intensities R and G as primary data, i.e., gene expression measures are absolute (Wolfinger, Churchill, Kerr) • Single channel approach makes new analyses possible but • make stronger assumptions • requires more complex models (mixed models in place of ordinary linear models) to accommodate correlation between R and G on same spot
BG correction affects DE results • Importance of careful pre-processing and quality control cannot be over-emphasized for microarray data • Can have dramatic effect on differential expression results • Consider here the normexp method of adaptive background correction • background correction step of the RMA algorithm • Can also be applied to two colour data
Additive + multiplicative error model Observe intensity for one probe on one array Intensity = background + signal I = B + S additiveerrors multiplicative errors This idea underlies variance stabilizing transformations vsn (two colour data) and vst (for Illumina data)
= + normexp convolution model Intensity = Background + Signal N(μ,s2) Exponential(a)
normexp background correction • Estimate the three parameters • Replace I with E(S|I) • For Affymetrix data, I is the “Perfect Match” data intensity • For two-colour data, I=Rf-Rb or I=Gf-Gb • In the RMA algorithm, parameter estimation uses an ad hoc density kernel method • In limma (two colour), parameter estimation maximises the saddlepoint approximation to the likelihood
Background corrected intensity = E(Signal | Observed Intensity) Adaptive background correction produces positive signal E( Signal | Intensity) Observed Intensity
Offsets to stabilise the variance Background correction Log-ratios Offset reduces variability at low intensities
Why do offsets stabilize the variance? • log2( 800/100 ) = log2 ( 8/1 ) = 3 (8-fold) • Additive noise affects small numbers more • Offsets introduce bias: • log2[(80+10)/(10+10)] = 2.17 • But the tradeoff (drop in variance for increase in bias) is usually worth it
A self-self experiment:two background methods 553 spots not plotted
Comparison of 2-colour BG correction methods False discoveries (limma) Genes selected Ritchie et al. 2007
References • Silver et al. (2009). Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics. [complete mathematical development of the saddle point approximation] • Ritchie et al. (2007). A comparison of background correction methods for two-colour microarrays. Bioinformatics. [shows “normexp” performs best for 2-colour data] • Irizarry et al. (2003). Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics. [Describes RMA BG correction, but doesn't give much detail of the normexp convolution model.]
Normalization Two-colour BG correction and normalization are closely connected Even after BG correction, some effects remain.
Normalization One-colour Similarly for single channel data, adjustments need to be made for all samples to be comparable.
Borrowing information across genes • Small data sets: few arrays, inference for individual genes is uncertain • Curse of dimensionality: many tests, need to adjust for multiple testing, loss of power • Benefit of parallelism: same model is fitted for every gene. Can borrow information from one gene to another
Hard and soft shrinkage • Hard: simplest way to borrow information is to assume that one or more parameters are constant across genes • Soft: smooth genewise parameters towards a common value in a graduated way, e.g., Bayes, empirical Bayes, Stein shrinkage …
Wild-type mouse x 2 A very common experiment Mutant mouse x 2 Gene X Which genes are differentially expressed? n1 = n2 = 2 Affymetrix arrays 25,000 probe-sets
Ordinary t-tests give very high false discovery rates Residual df = 2
Another very common experiment Wild-type mouse 1 Mutant mouse 1 Wild-type mouse 2 Mutant mouse 2 Which genes are differentially expressed? n = 2 two-colour arrays 30,000 probes
Ordinary t-tests give very high false discovery rates Residual df = 1
Small sample size, many tests The problem: • These experiments would be under-powered even with just one gene • Yet we want to test differential expression for each of 50k genes, hence lots of multiple testing and further loss of power The solution: The same statistical model is being fitted for every gene in parallel. Can borrow strength from other genes.
t-tests with common variance with residual standard deviation pooled across genes More stable, but ignores gene-specific variability
A better compromise Shrink standard deviations towards common value = degrees of freedom Moderated t-statistics
Shrinkage of standard deviations The data decides whether should be closer to or to
Why does it work? • We learn what is the typicalvariability level by looking at all genes, but allow some flexibility from this for individual genes • Adaptive
Hierarchical model for variances Data Prior Posterior
Posterior Statistics Posterior variance estimators Moderated t-statistics Baldi & Long 2001, Wright & Simon 2003, Smyth 2004
An unexpected piece of mathematics shows that, under the null hypothesis, The degrees of freedom add! The Bayes prior in effect adds d0 extra arrays for estimating the variance. Exact distribution for moderated t Wright and Simon 2003, Smyth 2004
Hierarchical model for means Data Prior Lönnstedt and Speed 2002, Smyth 2004
Hence gives the best possible ranking of genes Posterior Odds Posterior odds of differential expression Monotonic function of Lönnstedt and Speed 2002, Smyth 2004
Estimating Hyper-Parameters Closed form estimators with good properties are available: for s0 and d0 in terms of the first two moments of log s2 for c0 in terms of quantiles of the
Marginal Distributions Under usual likelihood model, sg is independent of the estimated coefficients. Under the hierarchical model, sg is independent ofthe moderated t-statistics instead
Moment estimators for s0 and d0 Marginal moments of log s2 lead to estimators ofs0 and d0: Estimated0 by solving where Finally
Quantile Estimation of c0 Let r be rank of in descending order, and let F(;) be the distribution function of the t-distribution. Can estimate c0 by equating empirical to theoretical quantiles: Get overall estimator of c0 by averaging the individual estimators from the top p/2 proportion of the