1 / 47

Statistical Analysis

Statistical Analysis. Dr. Lars Eijssen. Contents. Statistics of differential gene expression Multiple testing Unsupervised methods The arrayanalysis statistics module. Part 1:. Statistics of differential gene expression. Data analysis overview. Microarray scans.

yank
Download Presentation

Statistical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. StatisticalAnalysis Dr. Lars Eijssen

  2. Contents • Statistics of differential gene expression • Multiple testing • Unsupervisedmethods • The arrayanalysisstatistics module

  3. Part 1: Statistics of differential gene expression

  4. Data analysis overview Microarray scans Image analysis Raw data • Background correction • Normalisation Quality control Further pre-processing Normalised data Statistical analysis List of regulated genes Pattern analysis Pathway analysis Literature data Results Untreated (control) Exposed to compound Slidebasedon a slidefrom J. Pennings, RIVM, NL

  5. What is changed? • “Every gene that has changedtwo-fold is relevant” • Doesn’ttakevariationinto account

  6. Statisticaltesting • Soadd a statistical test withnull-hypothesisthat the gene is notchangedbetween the groups • Thisgivesyou – apart from the change – also a significance level (P value)

  7. Input forstatistics • Normalised data table • Groupinginformation

  8. Output of statistics • List of differentiallyexpressedgenesbetweenexperimentalgroups • Howmuchdifference? • How significant? • Replicates

  9. Recall

  10. The FoldChange • On is interested in computing the fold-changebetweenexperimentalgroups • For example: Gene_A is 2 foldupregulated in patients versus controls Gene_A_patient / Gene_A_control = 2 • This is a divisionbetweengroups

  11. Asymmetry of the FoldChange • ‘raw’ ratio (FC) 0 ∞ ½ 1 2 Downregulated: packed in (0,1) Upregulated: spread over (1,∞)

  12. Log transformation • Afterlogging (and normalisation) onecancompute the difference in means (‘logFC’) betweenseveralexperimentalgroups 2log(Gene_A_patient / Gene_A_control) = 2log(2) 2log(Gene_A_patient) - 2log(Gene_A_control) = 1 • A difference is easier to handlestatisticallythan a division

  13. Symmetry of the loggedFoldChange • The logFC ‘spreads out’ the data and offers symmetry • ‘raw’ ratio (FC) • log ratio (logFC) ½ 1 2 2log of: ½ 1 2

  14. Linearregressionmodelling • Oftenusedapproach • For thosefamiliar: corresponds to ANOVA analysis

  15. A basicexample: twogroups • Suppose we have an experiment withpatients and controls • Howcan we compute the differencebetweenthosefor a certain gene?

  16. Experimentaldesign

  17. The model Gene_expression_A ~ group • Thismeansthat gene expression is modelled basedongroup • The average in patients is allowed to be different from the average in controls

  18. Contrasts • Linearregressionmodellingcomputescoefficientsforeach of the variables in the model • Such as group • From these we cancompute the differencesbetween the groups, calledcontrasts

  19. Contrasts and foldchanges • Contrastsdirectlycorrespond to the logFCbetween the groups • To get the FC (ratio) for the data on the originalscale we caneasilycompute: FC = 2^logFC

  20. More extensive models Gene_expression_A ~ group + day Gene_expression_A ~ group + day + group*day Orwhengroup has more thantwolevels Gene_expression_A ~ group enablesestimation of threecontrasts (group 1 versus 2, group 1 versus 3, and group 2 versus 3)

  21. Example output

  22. Recall:

  23. Significant, but … relevant??? • Is a FC of 1.005, with a p-value of 0.0001, biologically relevant? • Onecanalso put a cut-offon the FC

  24. Volcano Plot • Shows both the significance and the logFC • P valueson a -10log scale Image: J. Pennings, RIVM, NL

  25. Multiple testing

  26. Is a P of 0.05 significant? 5000 – 50000 tests

  27. Suppose 7000 genes • 0.05: expected:7000 * 0.05 = 350 bychance

  28. Correctionsfor multiple testing • FWER (family-wiseerror): correct the P-valuefor the number of tests • Most simpleexample is the Bonferronicorrection • Corrected P value = 0.05 / number of tests done • For example: 0.05 / 7000 = 7.14e-06 • Toostrict – anyresultsleft?

  29. FDR • Othercorrections are more realistic • For examplecorrecting the FalseDiscoveryRate • These correctionsmakesurethat the number of FalsePositives is controlled • Number of wrong hits / totalnumber of hits • Thismeansone does not have to consider the totalnumber of tests, butonly the number of positive (significant) tests

  30. Part 3: Unsupervisedmethods

  31. Supervised versus unsupervised • Methodssuch as statisticaltesting are supervised • Onecanalsoapplyunsupervisedmethods • Two of those we have alreadyseen at the QC

  32. Clustering • Onecan cluster samples, genes or both Image from J. Pennings, RIVM, NL

  33. Similarity of twoexpression profiles • Euclideandistance • Correlationdistance

  34. a,b a,b,c,d,e c,d,e d,e Building a tree 0 1 2 3 4 a b c d e Tree is constructed! Adapted from Kaufman and Rousseeuw (1990)

  35. PCA analysis • alsoherescaling is important

  36. Part 4: The arrayanalysis.orgstatistics module

  37. Limma • A Bioconductor packagefor R thatallowsforlinearmodeling • Itusesan ‘adapted’ t-test • improvedestimate of variation • I does a Bayesiansmoothingon the P-values • Thispackage is calledbyarrayanalysis.org

  38. arrayanalysis.org • Besides the QC and normalization module, itcontains a module forstatisticalanalysis • This has notyet been added to the open site • So we willworkon the developers’ site in the afternoon

  39. P value histogram

  40. Number of significant genestable

  41. Resultstable

  42. Filteredresultstable(Significant genes list)

  43. Arrayanalysis.org • We use the bèta version of the AffyAnalysisStat module • Someinconveniences / small bugs are stillthere • Don’tworry!  • In the practical youwillgetinstructionshow to operateit

  44. Project members Lars Eijssen MagaliJaillard AnweshaDutta

More Related