1 / 23

Bioinformatics

This article discusses the differential expression analysis in bioinformatics, including preprocessing, filtering, normalization, and statistical testing methods such as T-test and SAM.

smithcindy
Download Presentation

Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006

  2. Preprocessing Array by array approach ANOVA based Background corr Background corr Log transformation Log transformation Filtering Filtering normalization Linearisation Ratio Test statistic (T-test) Bootstrapping

  3. Overview further analysis Raw data Preprocessing Preprocessed data Test statistic Clustering Clusters of coexpressed genes Differentially expressed genes

  4. Preprocessing: test statistic Test Statistic Comparison of 2 experiments: • Fold test • T-test • SAM • … A plethora of different method available Which one performs best? Different underlying statistical assumptions Implication on the final result Difficult to define the best method

  5. Diff Expr Genes: test statistic Type1: Comparison of 2 samples Control sample Induced sample Statistical testing Retrieve statistically over or under expressed genes

  6. Diff Expr Genes : test statistic • black/white experiment description (array V mice genes) • Condition 1 : pygmee mouse 10 days old (test) • Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Array 1 Per gene, per condition 4 measurements available Array 2

  7. Diff Expr Genes : test statistic Fold change (ratio test) 4 measurements per gene, condition Calculate average Sort averages log(Sample/control) > threshold (usually 2) • Arbitrary threshold • Discards all information obtained from replicates • Implicitly assumes constant variance but variance depends on expression value

  8. Diff Expr Genes : test statistic Why does fold chance fail: • Majority of genes expressed at low levels where signal/noise is low => not sufficiently conservative • 2 fold change occurs at random for a large number of genes • High number of false positives • Higher levels of expression smaller changes in gene expression may be real => too conservative • High number of false negatives Improvement: • T-test • pairwise fold change: genes significantly differentially expressed if R=-fold change is observed consistently between paired samples • SAM http://www-stat-class.stanford/SAM/SAMServlet

  9. Diff Expr Genes : test statistic T-test: hypothesis test • Possible if replicates of reference and test are available • Significance of the difference between the reference and test data (level of expression) relative to the observed level of within class variation(consistency) • Assumptions • Normal distribution of variables • Population mean and variance estimated from data => (Student t distribution for H0 hypothesis) • Not all genes need to have the same variance • Under null hypothesis sample means should be equal (rescaling obligatory)

  10. Diff Expr Genes : test statistic Paired t-test (microarray data are paired) • Consider paired data as new variable • Calculate average ratio • Calculate standard deviation of the 4 ratio measurements Determine t-value df, student t distribution, t-value p-value p-value (represents the probability that a certain null hypothesis is true)

  11. Gene x Type I Type II H0: D=0 H1: D<>0 H0 H1 Diff Expr Genes : test statistic t-test • Classical hypothesis tests (t-test, Wilcoxon rank-sum test, ...): • a test statistic is calculated (t-value) • the probability or p-value is calculated that an equally good or better test statistic is generated if a certain null hypothesis is true • The null hypothesis: gene has no difference in mean expression levels between 2 conditions • Low p-value (below rejection level ): null hypothesis is not likely: reject null hypothesis: there is a difference in (mean) expression between the two classes

  12. Diff Expr Genes : test statistic Comparison of fold test with paired t-test • Gene expression levels measured under two different conditions • Rejection level  • pj < : null hypothesis rejected (result Positive) • pj > : null hypothesis not rejected (result Negative) • But: Multiple testing: Type I and Type II error = False positives and negatives

  13. Diff Expr Genes : test statistic SAM • Each gene is assigned a score on the basis of its change in expression relative to the standard deviation of repeated measurements for that gene • H0 (expected relative difference) is estimated by permutation analysis • Permute the samples • Calculate d(i) values for both the experimental samples and the permutated control samples • Rank genes by magnitude of their d(i) values for both the experimental and the permutated control samples

  14. Diff Expr Genes : test statistic SAM • Observed values • Calculate d(I) value for each gene • Rank genes according to their d(I) value • Simulated values • Permute dataset • Calculate d(I) value for each gene in each permuted dataset • Calculate average d(I) value for each gene • Rank d(I) values • Make scatterplot

  15. Diff Expr Genes : test statistic SAM

  16. Diff Expr Genes : test statistic Test statistic Assumptions Distribution H0 T-test Errors normally distributed Parametrized : Student t-distribution Restricted number of repeat measurements Impossible to evaluate assumption Paired t-test Errors equal variance (iid) No explicit assumption Order statistics SAM Less stringent assumption

  17. Diff Expr Genes : test statistic

  18. Diff Expr Genes: test statistic Multiple testing: problem • P value: measure of significance in terms of the false positive rate • The rate that truly null features are called significant • Significance is 5%: on average 5% of the truly null features will be called significant (type-I error) • Type I error: Null hypothesis rejected when it is true –‘accidental’ low p-value – falsely declared differentially expressed = false positive • Multiple testing: Example: 10000 genes with random expression profiles -  = 5% - one would find 500 genes with a p-value lower than 5% = false positives • Type II error: Null hypothesis not rejected when it is not true (false negatives). Gene that is actually differentially expressed is not declared differentially expressed. Adapted from De Smet et al

  19. Diff Expr Genes: test statistic Multiple testing: solutions • Control of the familywise error rate (FWE): • P(FP  1) – protection against type I errors • Bonferonni correction: reject null hypothesis at rejection level /N, which guarantees that FWE = P(FP  1) <  • Is OK when very few genes are expected to be actually differentially expressed (i.e., affected by the difference in conditions / for which the null hypopthesis is false): every false positive is ‘costly’ • Rejection rate becomes very conservative • But in microarray data, usually a considerable number of genes is actually differentially expressed: control of the FWE results in a severe loss of statistical power (FN or type II error is large) • In practice we do not have to protect against every possible FP Better solution FDR: false positive discovery rate Adapted from De Smet et al

  20. FDR Diff Expr Genes: test statistic • We need a sensible balance between the number of true positives and the number of false positives • Therefore is is better to control the ‘False Discovery Rate’ (FDR) instead of the FWE: • The false positive rate: The rate that truly null features are called significant • The FDR: = % of false positives among all the genes that are declared positive = % of true null hypotheses erroneously rejected among all the null hypotheses rejected Adapted from De Smet et al

  21. Diff Expr Genes: test statistic Difference p-value and FDR • 5% FDR: 5% false positives among the features called significant • 5% p value cutoff: 5% false positives among all the null features in the dataset, says little about the content of the features actually called significant

  22. An estimate of E[S(t)] is the observed S(t): i= the number of observed pvalues <pi • E[F(t)] = N0pi • Estimate N0 No real differential expression Randomised data set Uniform distribution Non-accidental differential expression Superposition of two distribuions Rejection level  TP FN FP TN Adapted from De Smet et al

  23. Overview MICROARRAY PREPROCESSING • Gene expression • Omics era • Transcript profiling • Experiment design • Preprocessing • Slide by slide normalisation • ANOVA • Exercises

More Related