590 likes | 1.29k Views
Affymetrix and BioConductor. Johannes Freudenberg Cincinnati Children’s Hospital Medical Center freudejm@uc.edu. Overview. Affymetrix' high-density oligonucleotide microarrays GeneChip ® Technology Terminology Spotted vs. Affymetrix Arrays
E N D
Affymetrix and BioConductor Johannes Freudenberg Cincinnati Children’s Hospital Medical Center freudejm@uc.edu
Overview • Affymetrix' high-density oligonucleotide microarrays • GeneChip® Technology • Terminology • Spotted vs. Affymetrix Arrays • Preprocessing Affymetrix MA data using BioConductor • Overview • Background correction • Normalization • PM correction • Summarization • Many preprocessing strategies available – which one to use? • Affymetrix & Limma Johannes Freudenberg, CCHMC
Affymetrix GeneChip® (=0.5 inch) Dudoit et al., 2002 Johannes Freudenberg, CCHMC
Affymetrix GeneChip® technology Lipshutz et al., 1999 Johannes Freudenberg, CCHMC
Affymetrix GeneChip® technology (2) Original Publication (Lockhart et al., 1996) Affymetrix (http://www.affymetrix.com/technology/ge_analysis/index.affx) Johannes Freudenberg, CCHMC
Affymetrix GeneChip® technology (3) Lipshutz et al., 1999 Johannes Freudenberg, CCHMC
Terminology • Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer. • Perfect match (PM): A 25-mer complementary to a reference sequence of interest (gene, ). • Mismatch (MM): same as PM but with base change for the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) .The purpose of the MM probe design is to measure non-specific binding and background noise. • Probe-pair: a (PM,MM) pair. • Probe set: a collection of probe-pairs (11 to 20) related to a common gene or EST. • AffyID: an identifier for a probe-pair set (eg. “A28102_at”) • CEL file: text (or binary) file containing raw probe intensities for a single chip created by MAS (GCOS) software • MAS: Microarray Suite – Affymetrix software package • GCOS: GeneChip operating software – new Affymetrix software Adapted from Dudoit et al., 2002 Johannes Freudenberg, CCHMC
Affymetrix vs. other technologies Johannes Freudenberg, CCHMC
What’s the evidence? • Expectation/ Hope(?):measured hybridization intensity proportional to # of mRNA transcripts • Spike-in experiment seems to confirm this • However • Only ten genes • Selection of these genes? Lockhart et al., 1996 Johannes Freudenberg, CCHMC
Affymetrix chips – Summary Biological Sample Extracted mRNA Labeled mRNA Hybridization, scanning & image processing Data (pre-)processing CEL files Johannes Freudenberg, CCHMC
Preprocessing for Affymetrix GeneChips® CEL files Two “standard” methods • MAS 5.0 (now GCOS/GDAS) by Affymetrix • RMA by Speed group (UC Berkeley) cross-chip sequence specific background correction within-probe set aggregation of intensity values within-chip Johannes Freudenberg, CCHMC
affy – BioConductor library • “extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe level data” (www.bioconductor.org) • Contains functions to • Load • Store • Plot, and • Preprocess Affymetrix microarray data • Many other related packages • affycomp, affydata, affyPLM, annaffy, gcrma, makecdfenv, matchprobes, simpleaffy, vsn, … Johannes Freudenberg, CCHMC
affy – BioConductor library • … if you haven‘t already done that • Download BioConductor install scriptsource("http://www.bioconductor.org/biocLite.R")(or source("http://www.bioconductor.org/getBioC.R")) • Run the scriptbiocLite() (or getBioC()) • Load the “affy” librarylibrary(affy) • Load our example data (3+3 subset from Bhattacharjee et al., 2001) • Run the following script to download the six CEL files (*)source("http://eh3.uc.edu/affy/downloadAffy.R") • Load CEL files into Rharvard.rawData <- ReadAffy() • If you prefer interactive harvard.rawData <- ReadAffy(widget = T) (*) Note: You may have to change your working directory to a path where you have writing permission Johannes Freudenberg, CCHMC
What is an AffyBatch object? • Take a first look at the experiment dataharvard.rawData • Experiment data is stored in an AffyBatch objectslotNames(harvard.rawData) • "cdfName" – GeneChip version • "nrow", "ncol" – size of the chip (usually 640x640) • "exprs" – expression matrix, contains all PM and MM intensities of the experiment • "phenoData" – experiment annotation • "description", "annotation" – more annotation slots • … Johannes Freudenberg, CCHMC
Accessing an AffyBatch object • Extracting • the expression matrixexprs(harvard.rawData)(*)or intensity(harvard.rawData)(*) • PM valuespm(harvard.rawData)[1:10,] • MM valuesmm(harvard.rawData)[1:10,] • probe namesprobeNames(harvard.rawData)[1:10] • gene namesgeneNames(harvard.rawData)[1:10] • a probe setprobeset(harvard.rawData, "100_g_at") • the type of GeneChipcdfName(harvard.rawData) • … (*) Don’t try this at home… Johannes Freudenberg, CCHMC
How does my data look? – Diagnostic plot I • Plot an image of an arrayimage(harvard.rawData[,1]) Johannes Freudenberg, CCHMC
How does my data look? – Diagnostic plot II • Plot the intensity distributionhist(harvard.rawData, main = "Harvard data") Johannes Freudenberg, CCHMC
How does my data look? – Diagnostic plot III • MvA plotsMAplot(harvard.rawData) Johannes Freudenberg, CCHMC
How does my data look? • Plot a probe setplot(probeset(harvard.rawData, geneNames(harvard.rawData)[1])[[1]]) Johannes Freudenberg, CCHMC
How does my data look? • Plot a probe setpar(mfrow = c(2, 3))barplot(probeset(harvard.rawData,geneNames(harvard.rawData)[1])[[1]]) Johannes Freudenberg, CCHMC
Intended to correct for optical effects Divide array into K zones (default K = 16) Lowest 2% of the intensities in zone k are used to compute background bZk and noise nZk of zone k Background b(x, y) of cell (x, y) is the weighted sum of all bZk Noise n(x,y) is computed likewise Background corrected intensity A(x,y) = max(I(x,y) – b(x,y), 0.5*n(x,y)) MAS 5.0 - Background correction (Affymetrix, 2002) Johannes Freudenberg, CCHMC
MAS 5.0 - Background correction harvard.mas5BG <- bg.correct(harvard.rawData,"mas")hist(harvard.mas5BG, main = "Harvard data after MAS 5.0 BG correction") Johannes Freudenberg, CCHMC
MAS 5.0 - Background correction par(mfrow = c(2,3)) MAplot(harvard.mas5BG) Johannes Freudenberg, CCHMC
S observed PM intensity Model: S sum of “true” signal X and background signal Y S = X + Y, where X ~ Exp(), Y~ N(,²) independent random variables Y S X RMA Background correction + = (Speed, 2002) Johannes Freudenberg, CCHMC
Correct for background by replacing S with E(X | S = s) To do that estimate , , from data Let a = s - - ²b = E(X | S = s) S RMA Background correction Johannes Freudenberg, CCHMC
RMA - Background correction harvard.rmaBG <- bg.correct(harvard.rawData,"rma")hist(harvard.rmaBG, main = "Harvard data after RMA BG correction") Johannes Freudenberg, CCHMC
RMA - Background correction par(mfrow = c(2,3)) MAplot(harvard.rmaBG) Johannes Freudenberg, CCHMC
Why should we normalize? • … to remove chip effects Johannes Freudenberg, CCHMC
MAS 5.0 – Global Constant Normalization • Global constant normalization: Statistical parameters are used as global (= per-chip) scaling factor, such as: • Sum • Median, Quantiles/Percentiles • Mean (also trimmed mean, asymmetric trimmed mean) • Normalization transforms data – afterwards parameter is equal for all chips • Intensity independent normalization(!) • MAS 5.0: 2% trimmed mean m (default = 500) on the linear scale (as opposed to the log scale) • Note: In MAS 5.0, this step is done after summarization Johannes Freudenberg, CCHMC
MAS 5.0 – Global Constant Normalization harvard.mas5norm <- normalize(harvard.mas5BG, "constant")hist(harvard.mas5norm, main = "Harvard data after MAS 5.0 BG & normalization") Johannes Freudenberg, CCHMC
MAS 5.0 – Global Constant Normalization par(mfrow = c(2,3)) MAplot(harvard.mas5norm) Johannes Freudenberg, CCHMC
RMA – Quantile normalization • Assumption: True intensity distributions identical in all replicate samples • Then all points lie on the diagonal in a Q-Q plot • Works the same wayin higher dimensions • If observed intensity distribution are different “force” them to be equal Johannes Freudenberg, CCHMC
RMA – Quantile normalization harvard.rmaNorm <- normalize(harvard.rmaBG, "quantiles") hist(harvard.rmaNorm, main = "Harvard data after RMA BG & normalization") Johannes Freudenberg, CCHMC
RMA – Quantile normalization par(mfrow = c(2,3)) MAplot(harvard.rmaNorm) Johannes Freudenberg, CCHMC
PM correction • Background correction intended to adjust signal for non-specific contributions such as • unspecific binding • cross-hybridization • auto- fluorescence of the surface • Traditional approach cannot be used with high-density arrays • Therefore, Affymetrix invented MMs, which are designed to measure non-specific signal contributions • Original idea: “true” signal = PM - MM • Problems: • PM < MM in approx. 30% of all probe pairs • MM signals really are specific (but less sensitive) Johannes Freudenberg, CCHMC
MAS 5.0 – PM correction • In MAS 5.0 “ideal” mismatch (IM) is computed • IM = MM if MM is “well-behaved” (i.e. MM < PM) • IM = down-scaled MM if MM > PM but most other MMs of the corresponding probeset are “well-behaved” • IM PM if most MMs are “defective” • Adjusted probe value is PM – IM • … for a more detailed description refer to Affymetrix’ Statistical Algorithms Description Document Johannes Freudenberg, CCHMC
harvard.mas5PMcorr <- pmcorrect.mas(harvard.mas5norm)plot(log2(pm(harvard.mas5norm[,6])),log2(harvard.mas5PMcorr[,6]), pch = ".", xlab = "PM", ylab = "corrected probe value", main = "Harvard data after MAS 5.0 PM correction") MAS 5.0 – PM correction Johannes Freudenberg, CCHMC
PM correction? • MMs are greater than corresponding PMs 30% of all probe pairs • Therefore, RMA does not do PM correction • RMA uses only PMs and ignores MMs Johannes Freudenberg, CCHMC
PM correction? • MM signals are specific (but less sensitive) • Hybridization behavior/ labeling efficiency depends on middle base Chudin et al., 2001 Johannes Freudenberg, CCHMC
PM correction! • Intensities depend on number of A, C, G, and T, respectively • Intensities depend on position of A, C, G, and T, respectively Wu et al., 2004 • New model based approach – GCRMA (Wu et al., 2004) Johannes Freudenberg, CCHMC
Summarization • Purpose • for every probe set (i.e. gene/ EST), merge the 16 to 20 probe values into a single expression value • MAS 5.0 • Computes Tukey’s Bi-Weight (a robust weighted mean) for every probe set • Does not “borrow” information from other arrays • RMA • Performs Median-Polish (a robust method to fit a linear model similar to a two-way ANOVA model) • Information is “borrowed” from other arrays • Note: Reported RMA expression values are on log2 scale! Johannes Freudenberg, CCHMC
Summarization harvard.mas5expr<-computeExprSet(harvard.mas5norm,"mas","mas") harvard.rmaExpr <-computeExprSet(harvard.rmaNorm,"pmonly", "medianpolish") plot(log2(exprs(harvard.mas5expr)[,1]), log2(exprs(harvard.mas5expr)[,4]), xlab = "Adeno 1", ylab = "Normal 1", main = "MAS 5") plot(exprs(harvard.rmaExpr)[,1], exprs(harvard.rmaExpr)[,6], xlab = "Adeno 1", ylab = "Normal 1", main = "RMA") Johannes Freudenberg, CCHMC
GCRMA expresso – does it all at once • Example:harvard.exprData <- expresso(harvard.rawData, bgcorrect.method = "rma", normalize.method = "constant", pmcorrect.method = "pmonly", summary.method = "avgdiff") • Result is an “expression set” which is a subclass of “AffyBatch” Johannes Freudenberg, CCHMC
Shortcuts • Some of the most widely used strategies have wrapper functions • MAS 5.0harvard.mas5 <- mas5(harvard.rawData) • RMAharvard.rma <- rma(harvard.rawData) • GCRMAlibrary(gcrma)harvard.gcrma <- gcrma(harvard.rawData) Johannes Freudenberg, CCHMC
What preprocessing strategy to use? • … No one knows • But there is evidence that • MAS 5.0 is not a good idea • RMA is a much better alternative • Other, model-based approaches work well (e.g. GCRMA, VSN) • You can evaluate your favorite strategy using benchmark data • Spike-In experiments (Affymetrix, GeneLogic) • Dilution experiment (GeneLogic) Johannes Freudenberg, CCHMC
affycomp • affycomp is an R package to evaluate your favorite preprocessing strategy using publicly available benchmark data • It’s also a website (http://affycomp.biostat.jhsph.edu/) • 69 different strategies submitted so far • Limitations of benchmark data(?) • Unrealistically high data quality • Unrealistic set-up • “Over-fitting”(?) Johannes Freudenberg, CCHMC
Take-home messages • Affymetrix microarrays are in-situ synthesized oligonucleotide arrays • Each gene is represented by a set of perfect match probes and mismatch probes • Raw probe intensities require preprocessing • Background correction (optional) • Normalization (recommended) • PM correction using MM probes (optional) • Summarization (required) • BioConductor’s affy package offers the necessary tools for handling Affymetrix data • It really does matter which strategy you use! Johannes Freudenberg, CCHMC
Limma can handle Affy data • Just to double check…> harvard.rma@phenoData@pDatasampleadeno1.CEL 1adeno2.CEL 2adeno3.CEL 3normal1.CEL 4normal2.CEL 5normal3.CEL 6 • Load the library…> library(limma) • Please read the user’s guide…> limmaUsersGuide() Johannes Freudenberg, CCHMC
Limma can handle Affy data • Need to define regression model…> design <- model.matrix(~1+factor(c(0, 0, 0, 1, 1, 1)))> colnames(design) <- c("adeno", "normVsCa")> designadeno normVsCa1 1 02 1 03 1 04 1 15 1 16 1 1 First coefficient • estimates mean log-expression for adeno carcinoma and • plays the role of an intercept Second coefficient • estimates difference between carcinoma and normal tissue Johannes Freudenberg, CCHMC