1 / 158

Microarray Data Analysis

Microarray Data Analysis. Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University. Outline. Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Identifying functional groups. Array Quality Assessment . ISMB 2008 Microarray Tutorial

Thomas
Download Presentation

Microarray Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarray Data Analysis Tutorial at ISMB 2008 Mark Reimers Virginia Commonwealth University

  2. Outline Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Identifying functional groups

  3. Array Quality Assessment ISMB 2008 Microarray Tutorial Mark Reimers Virginia Commonwealth University

  4. Approaches to Quality Control • Careful Technique • Practice! • Testing the RNA quality • Examining the chips and images • Spot metrics • Statistical approaches • Examining deviations as functions of technical variables

  5. Simple QA for Spotted Arrays • Spot Measures • Signal/Noise • Foreground / background or • foreground / SD • Uniformity • Spot Area • Global Measures • Qualitative assessments • Averages of spot measures • Inspect images for artifacts • Streaks of dye, scratches etc. • Are there biases in regions?

  6. Using Statistics for QA • Significant artifacts may not be obvious from visual inspection or bulk statistics • General approach: plot deviations from average or residuals from fit against any technical variable: • Intensity • CG content or Tm • Probe position relative to 3’ end (for poly-T primed) • Location on chip • Color-code deviations on chip image

  7. Simple Portrait - Boxplots • Boxplot of 16 chips from Cheung et al Nature 2005

  8. Another Portrait - Densities

  9. Saturation Decreasing rate of binding of RNA at higher occupancies on probe Quenching: Light emitted by one dye molecule may be re-absorbed by a nearby dye molecule Then lost as heat Effect proportional to square of density Ratio vs Intensity Plots: Saturation & Quenching Plot of log ratio against average log intensity across chips GSM25377 from the CEPH expression data GSE2552

  10. How Much Variability on R-I? • Ratio-Intensity plots for six arrays at random from Cheung et al Nature (2005)

  11. Covariation with Probe Tm • MAQC project • Agilent 44K • Array 1C3 • Performed by Agilent • Plot of log ratios to average against Tm • Bimodal distribution because two samples are very different

  12. Covariation with Probe Position • RNA degrades from 5’ end • Intensity should decrease from 3’ end uniformly across chips • affyRNAdeg plots in affy package Plot of average intensity for each probe position across all genes against probe position

  13. Spatial Variation Across Chips Red/Green ratios show variation -probably concentrated Ratios of ratios on slide to ratios on standard show consistent biases

  14. In House Spotted Arrays Ratio of ratios shows much clearer concentration of red spots on some slides Note non-random but highly irregular concentration of red Legend

  15. Background Subtraction (1) • We think that local background contributes to bias • Does subtracting background remove bias? Local off-spot background may not be the best estimate of spot background (non-specific hyb) Spots BG subtracted

  16. Background Subtraction (2) Raw Ratios Background BG-subtracted Raw spot ratios show a mild bias relative to average After subtracting a high green bg in the center a red bias results

  17. Other Bias Patterns Processed Raw Spot Background This spotted oligo array shows strong biases at the beginning and end of each print-tip group The background shows a milder version of this effect Subtracting background compensates for about half this effect

  18. Local Bias on Affymetrix Chips Image of raw data on a log2 scale shows striations but no obvious artifacts Image of ratios of probes to standard shows a smudge Non-coding probes Images show high values as red, low values as yellow

  19. Variation in Affy Chips

  20. QC in Bioconductor • Robust Multi-chip Analysis (RMA) • fits a linear model to each probe set • High residuals show regional patterns High residuals in green Portion of dChip QA image High residuals in pink www.dchip.org Available in affyPLM package at www.bioconductor.org

  21. Affy QC Metrics in Bioconductor • affyPLM package fits probe level model to Affymetrix raw data • NUSE - Normalized Unscaled Standard Errors • normalized relative to each gene • How many big errors?

  22. Conclusions • Technical bias is a significant source of error in microarray studies • Technical bias can be visualized and quantified • Normalization can only partially compensate these problems • It is best to drop chips with extreme technical deviations

  23. Are Microarrays Reliable? • Microarray studies have an uneven track record of replication • Huang et al (2003) replicated few of the markers for breast cancer survival identified by van t’Veer et al (2002) • Two Science papers in 2003 on stem cell gene expression describing parallel experiments identify only 2 genes in common out of hundreds • The MAQC project papers in fall 2006 claimed to validate microarrays. Measures of consistency were highly variable across any two platforms • Maybe both are true: • careful microarray studies are accurate

  24. Further Questions

  25. Quality assessment Normalizing expression arrays Normalizing other array types Selecting genes Identifying functional groups

  26. Normalizing Expression Data ISMB 2008 Tutorial

  27. QA and Normalization • Technical differences cause changes in measures • QA flags big technical differences in arrays • Often sporadic – e.g. scratches on chip • Normalization attempts to compensate for modest but systematic differences • e.g. intensity-dependent bias (quenching) • Both must be done together

  28. Variations in Technique with Broad Consequences for Measures • Temperature of hybridization • Amount of RNA • Degradation of RNA • Yield of conversion to cDNA or cRNA • Yield of labeling reaction • Strength of ionic buffers • Stringency of wash

  29. Many Normalization Approaches • One Parameter • Total or median brightness • Two parameter • Variance stabilizing • Lowess for two-color arrays • Non-parametric • Distribution (quantile) matching • Regression on technical covariates

  30. One Parameter – Overall Mean • Can only measure relative levels of expression: per mg RNA • Assume: only difference between chips is due to different weights of RNA hybridized • Set: • For each chip, normalized values are: • For 2-color cDNA use separate constant for each channel on each chip • More consistent results if use a robust estimator, such as median or 1/3 – trimmed mean (Quackenbush): take mean of middle 2/3 of probes

  31. One-Parameter Limitations A centering transform can shift the density of log2 data but can’t get around the differences in shape of distributions

  32. Intensity Dependent Bias Ratio – Intensity (M-A) plot of raw data: M = log2(R/G) ; A = (log2(R) + log2(G)) / 2

  33. Global (lowess) Normalization Same data set normalized by: Mnorm = M-c(A) where c(A) is an intensity dependent function estimated by local regression.

  34. Print-tip Normalization Separate lowess curves for each of 16 print-tips Print-tip layout Box plots for log ratios in each of 16 print-tip groups

  35. Scaled Print-tip Normalization There still remain apparent technical artifacts Try to fix them by scaling after print-tip lowess: Mp,norm = sp·(Mp-cp(A)); Box plot after print-tip normalization Box plot after scaled print-tip normalization

  36. Spatial Effects No normalization Global normalization Scaled Print-tip normalization Print-tip normalization

  37. This is too Complex! • We are piling fudge factor upon fudge factor • We don’t know what errors these fudges are introducing… • … and still haven’t removed all artifacts • How about something simpler!

  38. Quantile Normalization • Currently most widely used ‘best’ method • Implemented in BioC affy, oligo, limma • Ignores causes of variation and technical covariates • Shoehorns all data into the same shape distribution – matching quantiles

  39. Motivation: Probe Intensities in 23 Replicates on Affy chips Densities of intensities from GeneLogic spikein study; black is composite

  40. Quantile Normalization (Irizarry et al 2002) • Map values in each curve separately to their quantile within the distribution • Map quantiles of each distribution to quantiles of the reference curve The mapping by quantile normalization

  41. Ratio Intensity Correction? M-A plots of raw data from Affy chip pairs

  42. Ratio-Intensity: After Quantile Norm

  43. Critiques of Quantile Normalization • Artificially compresses variation of highly expressed genes • Confounds systematic changes due to cross-hybridization with changes in abundance to genes of low expression • Induces artifactual correlations in low-intensity genes

  44. Technical Variable Regression • Hypothesis: • Most technical variation between chips is caused by a few systemic factors • Probes with similar technical characteristics (Tm, position in gene, location on chip, intensity) will be distorted by similar amounts • Normalization: estimate bias due to technical factors by local averaging of deviations from reference profile

  45. Further Questions

  46. Models for Multiple-Probe Affymetrix and NimbleGen Data

  47. 3´ Gene Sequence Multiple oligo probes Perfect Match Mismatch Many Probes for One Gene How to combine signals from multiple probes into a single gene abundance estimate?

  48. Probe Variation • Individual probes don’t agree on fold changes • Probes vary by two orders of magnitude on each chip • CG content is most important factor in signal strength Signal from 16 probes along one gene on one chip

  49. Probe Measure Variation • Typical probes are two orders of magnitude different! • CG content is most important factor • RNA target folding also affects hybridization 3x104 0

  50. Many Approaches • Affymetrix MicroArray Suite • dChip - Li and Wong, HSPH • Bioconductor: • RMA - Bolstad, Irizarry, Speed, et al • affyPLM – Bolstad • gcRMA – Wu • Physical chemistry models – Zhang et al

More Related