1 / 19

Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004

Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004. Jo Hardin Pomona College jo.hardin@pomona.edu. Variability. key to statistics within slide vs. between slide replication red (Cy5) > green dye (Cy3): dye swap log (base 2) transformation. Example:.

yoko
Download Presentation

Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Variability & Statistical Analysisof Microarray DataGCAT – Georgetown July 2004 Jo Hardin Pomona College jo.hardin@pomona.edu

  2. Variability • key to statistics • within slide vs. between slide • replication • red (Cy5) > green dye (Cy3): dye swap • log (base 2) transformation

  3. Example: • Variation in Gene Expression Patterns in Follicular Lymphoma and the Response to Rituximab, by Bohen, Troyanskaya, Alter, Warnke, Botstein, Brown, and Levy • 2 groups: those who responded to treatment, and those who did not respond to treatment. • Cy5 dye used on malignant lymphoid tissues, Cy3 dye used on mRNA derived from cell lines • Biopsies obtained before treatment of Rituximab • Are there differences in gene expression across those who responded to treatment and those who didn’t?

  4. Data Cleaning • Individual points were median centered for each cDNA clone and filtered for data quality. • Data values are either:

  5. The Data:

  6. Differential Expression Across Two Groups • Fold Change • t-test • Wilcoxon Rank-Sum Test • SAM

  7. Fold Change • Of mean? Of median? • Across treatment groups? vs. reference group? • Small vs. large values • What about how variable the groups are?

  8. An Example using one gene:

  9. t-test • Test statistic: • p-value = probability of seeing your data or more extreme if there is no difference in the groups

  10. t-test in Excel • Syntax: TTEST(array1,array2,tails,type) • Example: • first group is in cells c3 – k3 • second group is in cells l3 – v3 • we want a two sided t-test (no preconceived idea about which group is more highly expressed) • we assume the variance is unequal in cell w3 type: “=ttest(c3:k3,l3:v3,2,3)”

  11. Wilcoxon Rank Sum Test • Instead of comparing averages, this test compares rankings (or medians) • In order to discount influential points, we replace the data values with their appropriate rankings. • We compute a z-test (sister of the t-test) on the ranked data.

  12. Up regulated genes Down regulated genes

  13. Technical Details • Replace values with ranks • Sum the ranks in the first group • Calculate hypothesized mean1 = n1*(n1+n2+1)/2 • Calculate hypothesized standard deviation1 = sqrt(n1*n2*(n1+n2+1)/12) • Calculate test statistic = (sum ranks – hyp mean1) / hyp stdev1 • Find the p-value using the normal distribution (probability of being greater than the test statistic if there are no differences in the two groups)

  14. Wilcoxon Rank Sum in Excel • Using the rank function, translate your data into ranks • Y3: “=RANK(C3,C3:V3)” this finds the rank of C3 in the range C3-V3 (you’ll probably get a “#value” here, that’s OK because C3 is empty for gene = IMAGE:253507) • Repeat this command for Z3 to AR3 keeping the second half of the function always C3:V3 • Copy the row from Y3 to AR3 and paste from Y4 to AR2366 • AS2: “=SUMIF(Y3:AG3,">0",Y3:AG3)” (sum rank grp1) • AT2: “=COUNT(Y3:AG3)*(COUNT(Y3:AR3)+1)/2” (mean1) • AU2: “=SQRT(COUNT(Y3:AG3)*COUNT(AH3:AR3)* (COUNT(Y3:AR3)+1)/12)” (stdev1) • AV2: “=(AS3-AT3)/AU3” (zscore1 = test stat) • AW2: “=2*(1-NORMDIST(ABS(AV3),0,1,TRUE))” (p-value)

  15. SAM (Significance Analysis of Microarrays) • is a statistical technique for finding significant genes in a set of microarray experiements • can be used in a comparison experiment • can also be used with a quantitative response (like tumor size) or with one class data

  16. Technical Details • For the ith gene, comparing two groups, the test statistic is: • Rank the di and keep as test statistics • Permute the data labels 100 times, and calculate expected values for the di given no structure. • Plot observed di vs. expected di

  17. False Discovery Rate • We know that the expected di were computed with no group structure. • Any “large” expected di values will be false positives. • If we see 30 observed di above some cutoff and 10 expected di above the same cutoff, we know that we probably have 10 false positives (though we can never know *which* genes are the false positives)

  18. Features of SAM • Slider – we can change the false discovery rate • Fold change – in addition to the false discovery rate, we can require the genes to be at some fold change threshold (on average) • Gene lists – gene lists are given along with corresponding significance levels • Web Link option for more information about particular genes

  19. Imputation • Most microarray data has missing values • If background is bigger than foreground, the observed signal will be negative! • Poor quality spots are removed prior to analysis. • SAM needs a full data set which can be computed by: • Substitution of the row average • Substitution using k-nearest neighbors

More Related