1 / 56

Affymetrix and BioConductor

Affymetrix and BioConductor. Johannes Freudenberg Cincinnati Children’s Hospital Medical Center freudejm@uc.edu. Overview. Affymetrix' high-density oligonucleotide microarrays GeneChip ® Technology Terminology Spotted vs. Affymetrix Arrays

cael
Download Presentation

Affymetrix and BioConductor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Affymetrix and BioConductor Johannes Freudenberg Cincinnati Children’s Hospital Medical Center freudejm@uc.edu

  2. Overview • Affymetrix' high-density oligonucleotide microarrays • GeneChip® Technology • Terminology • Spotted vs. Affymetrix Arrays • Preprocessing Affymetrix MA data using BioConductor • Overview • Background correction • Normalization • PM correction • Summarization • Many preprocessing strategies available – which one to use? • Affymetrix & Limma Johannes Freudenberg, CCHMC

  3. Affymetrix GeneChip® (=0.5 inch) Dudoit et al., 2002 Johannes Freudenberg, CCHMC

  4. Affymetrix GeneChip® technology Lipshutz et al., 1999 Johannes Freudenberg, CCHMC

  5. Affymetrix GeneChip® technology (2) Original Publication (Lockhart et al., 1996) Affymetrix (http://www.affymetrix.com/technology/ge_analysis/index.affx) Johannes Freudenberg, CCHMC

  6. Affymetrix GeneChip® technology (3) Lipshutz et al., 1999 Johannes Freudenberg, CCHMC

  7. Terminology • Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer. • Perfect match (PM): A 25-mer complementary to a reference sequence of interest (gene, ). • Mismatch (MM): same as PM but with base change for the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) .The purpose of the MM probe design is to measure non-specific binding and background noise. • Probe-pair: a (PM,MM) pair. • Probe set: a collection of probe-pairs (11 to 20) related to a common gene or EST. • AffyID: an identifier for a probe-pair set (eg. “A28102_at”) • CEL file: text (or binary) file containing raw probe intensities for a single chip created by MAS (GCOS) software • MAS: Microarray Suite – Affymetrix software package • GCOS: GeneChip operating software – new Affymetrix software Adapted from Dudoit et al., 2002 Johannes Freudenberg, CCHMC

  8. Affymetrix vs. other technologies Johannes Freudenberg, CCHMC

  9. What’s the evidence? • Expectation/ Hope(?):measured hybridization intensity proportional to # of mRNA transcripts • Spike-in experiment seems to confirm this • However • Only ten genes • Selection of these genes? Lockhart et al., 1996 Johannes Freudenberg, CCHMC

  10. Affymetrix chips – Summary Biological Sample Extracted mRNA Labeled mRNA Hybridization, scanning & image processing Data (pre-)processing CEL files Johannes Freudenberg, CCHMC

  11. Preprocessing for Affymetrix GeneChips® CEL files Two “standard” methods • MAS 5.0 (now GCOS/GDAS) by Affymetrix • RMA by Speed group (UC Berkeley) cross-chip sequence specific background correction within-probe set aggregation of intensity values within-chip Johannes Freudenberg, CCHMC

  12. affy – BioConductor library • “extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe level data” (www.bioconductor.org) • Contains functions to • Load • Store • Plot, and • Preprocess Affymetrix microarray data • Many other related packages • affycomp, affydata, affyPLM, annaffy, gcrma, makecdfenv, matchprobes, simpleaffy, vsn, … Johannes Freudenberg, CCHMC

  13. affy – BioConductor library • … if you haven‘t already done that • Download BioConductor install scriptsource("http://www.bioconductor.org/biocLite.R")(or source("http://www.bioconductor.org/getBioC.R")) • Run the scriptbiocLite() (or getBioC()) • Load the “affy” librarylibrary(affy) • Load our example data (3+3 subset from Bhattacharjee et al., 2001) • Run the following script to download the six CEL files (*)source("http://eh3.uc.edu/affy/downloadAffy.R") • Load CEL files into Rharvard.rawData <- ReadAffy() • If you prefer interactive harvard.rawData <- ReadAffy(widget = T) (*) Note: You may have to change your working directory to a path where you have writing permission Johannes Freudenberg, CCHMC

  14. What is an AffyBatch object? • Take a first look at the experiment dataharvard.rawData • Experiment data is stored in an AffyBatch objectslotNames(harvard.rawData) • "cdfName" – GeneChip version • "nrow", "ncol" – size of the chip (usually 640x640) • "exprs" – expression matrix, contains all PM and MM intensities of the experiment • "phenoData" – experiment annotation • "description", "annotation" – more annotation slots • … Johannes Freudenberg, CCHMC

  15. Accessing an AffyBatch object • Extracting • the expression matrixexprs(harvard.rawData)(*)or intensity(harvard.rawData)(*) • PM valuespm(harvard.rawData)[1:10,] • MM valuesmm(harvard.rawData)[1:10,] • probe namesprobeNames(harvard.rawData)[1:10] • gene namesgeneNames(harvard.rawData)[1:10] • a probe setprobeset(harvard.rawData, "100_g_at") • the type of GeneChipcdfName(harvard.rawData) • … (*) Don’t try this at home… Johannes Freudenberg, CCHMC

  16. How does my data look? – Diagnostic plot I • Plot an image of an arrayimage(harvard.rawData[,1]) Johannes Freudenberg, CCHMC

  17. How does my data look? – Diagnostic plot II • Plot the intensity distributionhist(harvard.rawData, main = "Harvard data") Johannes Freudenberg, CCHMC

  18. How does my data look? – Diagnostic plot III • MvA plotsMAplot(harvard.rawData) Johannes Freudenberg, CCHMC

  19. How does my data look? • Plot a probe setplot(probeset(harvard.rawData, geneNames(harvard.rawData)[1])[[1]]) Johannes Freudenberg, CCHMC

  20. How does my data look? • Plot a probe setpar(mfrow = c(2, 3))barplot(probeset(harvard.rawData,geneNames(harvard.rawData)[1])[[1]]) Johannes Freudenberg, CCHMC

  21. Intended to correct for optical effects Divide array into K zones (default K = 16) Lowest 2% of the intensities in zone k are used to compute background bZk and noise nZk of zone k Background b(x, y) of cell (x, y) is the weighted sum of all bZk Noise n(x,y) is computed likewise Background corrected intensity A(x,y) = max(I(x,y) – b(x,y), 0.5*n(x,y)) MAS 5.0 - Background correction (Affymetrix, 2002) Johannes Freudenberg, CCHMC

  22. MAS 5.0 - Background correction harvard.mas5BG <- bg.correct(harvard.rawData,"mas")hist(harvard.mas5BG, main = "Harvard data after MAS 5.0 BG correction") Johannes Freudenberg, CCHMC

  23. MAS 5.0 - Background correction par(mfrow = c(2,3)) MAplot(harvard.mas5BG) Johannes Freudenberg, CCHMC

  24. S observed PM intensity Model: S sum of “true” signal X and background signal Y S = X + Y, where X ~ Exp(), Y~ N(,²) independent random variables Y S X RMA Background correction + = (Speed, 2002) Johannes Freudenberg, CCHMC

  25. Correct for background by replacing S with E(X | S = s) To do that estimate , ,  from data Let a = s -  - ²b =  E(X | S = s) S RMA Background correction Johannes Freudenberg, CCHMC

  26. RMA - Background correction harvard.rmaBG <- bg.correct(harvard.rawData,"rma")hist(harvard.rmaBG, main = "Harvard data after RMA BG correction") Johannes Freudenberg, CCHMC

  27. RMA - Background correction par(mfrow = c(2,3)) MAplot(harvard.rmaBG) Johannes Freudenberg, CCHMC

  28. Why should we normalize? • … to remove chip effects Johannes Freudenberg, CCHMC

  29. MAS 5.0 – Global Constant Normalization • Global constant normalization: Statistical parameters are used as global (= per-chip) scaling factor, such as: • Sum • Median, Quantiles/Percentiles • Mean (also trimmed mean, asymmetric trimmed mean) • Normalization transforms data – afterwards parameter is equal for all chips • Intensity independent normalization(!) • MAS 5.0: 2% trimmed mean m (default = 500) on the linear scale (as opposed to the log scale) • Note: In MAS 5.0, this step is done after summarization Johannes Freudenberg, CCHMC

  30. MAS 5.0 – Global Constant Normalization harvard.mas5norm <- normalize(harvard.mas5BG, "constant")hist(harvard.mas5norm, main = "Harvard data after MAS 5.0 BG & normalization") Johannes Freudenberg, CCHMC

  31. MAS 5.0 – Global Constant Normalization par(mfrow = c(2,3)) MAplot(harvard.mas5norm) Johannes Freudenberg, CCHMC

  32. RMA – Quantile normalization • Assumption: True intensity distributions identical in all replicate samples • Then all points lie on the diagonal in a Q-Q plot • Works the same wayin higher dimensions • If observed intensity distribution are different “force” them to be equal Johannes Freudenberg, CCHMC

  33. RMA – Quantile normalization harvard.rmaNorm <- normalize(harvard.rmaBG, "quantiles") hist(harvard.rmaNorm, main = "Harvard data after RMA BG & normalization") Johannes Freudenberg, CCHMC

  34. RMA – Quantile normalization par(mfrow = c(2,3)) MAplot(harvard.rmaNorm) Johannes Freudenberg, CCHMC

  35. PM correction • Background correction intended to adjust signal for non-specific contributions such as • unspecific binding • cross-hybridization • auto- fluorescence of the surface • Traditional approach cannot be used with high-density arrays • Therefore, Affymetrix invented MMs, which are designed to measure non-specific signal contributions • Original idea: “true” signal = PM - MM • Problems: • PM < MM in approx. 30% of all probe pairs • MM signals really are specific (but less sensitive) Johannes Freudenberg, CCHMC

  36. MAS 5.0 – PM correction • In MAS 5.0 “ideal” mismatch (IM) is computed • IM = MM if MM is “well-behaved” (i.e. MM < PM) • IM = down-scaled MM if MM > PM but most other MMs of the corresponding probeset are “well-behaved” • IM  PM if most MMs are “defective” • Adjusted probe value is PM – IM • … for a more detailed description refer to Affymetrix’ Statistical Algorithms Description Document Johannes Freudenberg, CCHMC

  37. harvard.mas5PMcorr <- pmcorrect.mas(harvard.mas5norm)plot(log2(pm(harvard.mas5norm[,6])),log2(harvard.mas5PMcorr[,6]), pch = ".", xlab = "PM", ylab = "corrected probe value", main = "Harvard data after MAS 5.0 PM correction") MAS 5.0 – PM correction Johannes Freudenberg, CCHMC

  38. PM correction? • MMs are greater than corresponding PMs 30% of all probe pairs • Therefore, RMA does not do PM correction • RMA uses only PMs and ignores MMs Johannes Freudenberg, CCHMC

  39. PM correction? • MM signals are specific (but less sensitive) • Hybridization behavior/ labeling efficiency depends on middle base Chudin et al., 2001 Johannes Freudenberg, CCHMC

  40. PM correction! • Intensities depend on number of A, C, G, and T, respectively • Intensities depend on position of A, C, G, and T, respectively Wu et al., 2004 • New model based approach – GCRMA (Wu et al., 2004) Johannes Freudenberg, CCHMC

  41. Summarization • Purpose • for every probe set (i.e. gene/ EST), merge the 16 to 20 probe values into a single expression value • MAS 5.0 • Computes Tukey’s Bi-Weight (a robust weighted mean) for every probe set • Does not “borrow” information from other arrays • RMA • Performs Median-Polish (a robust method to fit a linear model similar to a two-way ANOVA model) • Information is “borrowed” from other arrays • Note: Reported RMA expression values are on log2 scale! Johannes Freudenberg, CCHMC

  42. Summarization harvard.mas5expr<-computeExprSet(harvard.mas5norm,"mas","mas") harvard.rmaExpr <-computeExprSet(harvard.rmaNorm,"pmonly", "medianpolish") plot(log2(exprs(harvard.mas5expr)[,1]), log2(exprs(harvard.mas5expr)[,4]), xlab = "Adeno 1", ylab = "Normal 1", main = "MAS 5") plot(exprs(harvard.rmaExpr)[,1], exprs(harvard.rmaExpr)[,6], xlab = "Adeno 1", ylab = "Normal 1", main = "RMA") Johannes Freudenberg, CCHMC

  43. GCRMA expresso – does it all at once • Example:harvard.exprData <- expresso(harvard.rawData, bgcorrect.method = "rma", normalize.method = "constant", pmcorrect.method = "pmonly", summary.method = "avgdiff") • Result is an “expression set” which is a subclass of “AffyBatch” Johannes Freudenberg, CCHMC

  44. Shortcuts • Some of the most widely used strategies have wrapper functions • MAS 5.0harvard.mas5 <- mas5(harvard.rawData) • RMAharvard.rma <- rma(harvard.rawData) • GCRMAlibrary(gcrma)harvard.gcrma <- gcrma(harvard.rawData) Johannes Freudenberg, CCHMC

  45. What preprocessing strategy to use? • … No one knows • But there is evidence that • MAS 5.0 is not a good idea • RMA is a much better alternative • Other, model-based approaches work well (e.g. GCRMA, VSN) • You can evaluate your favorite strategy using benchmark data • Spike-In experiments (Affymetrix, GeneLogic) • Dilution experiment (GeneLogic) Johannes Freudenberg, CCHMC

  46. affycomp • affycomp is an R package to evaluate your favorite preprocessing strategy using publicly available benchmark data • It’s also a website (http://affycomp.biostat.jhsph.edu/) • 69 different strategies submitted so far • Limitations of benchmark data(?) • Unrealistically high data quality • Unrealistic set-up • “Over-fitting”(?) Johannes Freudenberg, CCHMC

  47. Take-home messages • Affymetrix microarrays are in-situ synthesized oligonucleotide arrays • Each gene is represented by a set of perfect match probes and mismatch probes • Raw probe intensities require preprocessing • Background correction (optional) • Normalization (recommended) • PM correction using MM probes (optional) • Summarization (required) • BioConductor’s affy package offers the necessary tools for handling Affymetrix data • It really does matter which strategy you use! Johannes Freudenberg, CCHMC

  48. Affymetrix and Limma

  49. Limma can handle Affy data • Just to double check…> harvard.rma@phenoData@pDatasampleadeno1.CEL 1adeno2.CEL 2adeno3.CEL 3normal1.CEL 4normal2.CEL 5normal3.CEL 6 • Load the library…> library(limma) • Please read the user’s guide…> limmaUsersGuide() Johannes Freudenberg, CCHMC

  50. Limma can handle Affy data • Need to define regression model…> design <- model.matrix(~1+factor(c(0, 0, 0, 1, 1, 1)))> colnames(design) <- c("adeno", "normVsCa")> designadeno normVsCa1 1 02 1 03 1 04 1 15 1 16 1 1 First coefficient • estimates mean log-expression for adeno carcinoma and • plays the role of an intercept Second coefficient • estimates difference between carcinoma and normal tissue Johannes Freudenberg, CCHMC

More Related