1 / 19

Classifying Gene Expression Profiles from Pairwise mRNA Comparisons

Classifying Gene Expression Profiles from Pairwise mRNA Comparisons. Yan Qi Biomedical Engineering Department 10/3/2006. Outline. The problem of Molecular classification The TSP classifier

twyla
Download Presentation

Classifying Gene Expression Profiles from Pairwise mRNA Comparisons

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifying Gene Expression Profiles from Pairwise mRNA Comparisons Yan Qi Biomedical Engineering Department 10/3/2006

  2. Outline • The problem of Molecular classification • The TSP classifier • Results on three Cancer datasets • Extensions of the TSP classifier

  3. Molecular Classification • Objective: predict class labels, e.g. cancer subtypes, disease states using gene expression profiles • Data: G x n matrix, G is number of genes, n is number of samples, each column is a gene expression profile

  4. Mathematical formulation • Gene expression profile: X = ( X1, X2, …, XG) • Binary Class label: Y = 1 or Y = 2 • Classifier: A mapping f from X toY • Training dataset: • A = G x n matrix, n = n1+n2 • Y: G x 1 vector where n1 entries are 1, n2 entries are 2 • Learning: find a mapping from A to f that minimizes generalization error

  5. Challenges to standard learning methods • Statistical dilemma: n << G • Examples • Consequence: Over-fitting hence poor generalizability • Practical issue: complex f and DB • Example:ANN, SVM, random forests • Consequence: results are hard to interpret biologically, inefficient in diagnostic settings

  6. The TSP classifier:Motivation and strategy • Rank-based scoring exploits the expression levels of genes relative to each other and obtains invariance to normalization. • Reduce model complexity by making the classifier parameter free • Select informative gene pairs by proper LOOCV and construct intuitive and biologically interpretable classification rule by voting.

  7. TSP classifier:rank-based score • Idea: replace expression value by genes’ ranks within profiles. • Feature: • Score: where

  8. Gene pair selection • TSP: • The number of TSP and sample size • A few when sample size is not too small ( >102). • Many when sample size is small ( < 102 ). • Example: myocardial tissue gene expression profiles • G = 22283; n1=12, n2=10; • 2460 statistically significant TSPs

  9. Classification with one gene pair • Let be a unique TSP • Suppose • TSP classifier: • Error on training set:

  10. An Example

  11. Classification with multiple gene pairs – majority vote • Seek a mapping from outputs of multiple single TSP classifiers to a final prediction. • Let Output from represent a vote from TSP i. Final prediction = class that receives the majority vote from • Assume the features are conditionally independent given the class and equal, a Naïve Bayes classifier is equivalent to using majority vote

  12. Classification with multiple TSPs:majority vote & Naïve Bayes classifier • Assume Where Let Naïve Bayes Classifier: • Each TSP contribute equally if

  13. Loop of cross-validation • Leave-one-out CV • Estimated accuracy: 1 - e/n where e is total number of errors in cross-validation • Only the TSPs are determined by CV, unbiased error estimate. • More complicated models, e.g. ANN and Decision tree need to include both model topology and parameters in CV loop, more likely to be biased.

  14. Three bench-mark cancer datasets • Determine lymphnode status in breast tumor samples ( West et al. 2001, G=7129, n = 49 ) • Classify leukemia subtypes ( Golub et al 1999, G= 7129, n=72) • Distinguish prostate tumors from normal samples ( Singh et al. 2002, G = 12600, n=102)

  15. Statistical significance of the score is evaluated by permutation test • Repeat e.g. 1000 times • Keep feature matrix A • Randomize class labels by keeping • n1 and n2 unchanged • Get top score Zmax • Get histogram of Zmax

  16. The top scoring gene pairs for the three cancer studies

  17. What does TSP represent? • Change weak predictors into strong predictors? • Change reference e.g. gene 3 as reference for gene 1 and gene 2 • Combine two markers e.g. x4 and X5 might be individual markers, one for each class, x4|-|X5 ?

  18. Performance of TSP classifier compared with previous studies • Breast cancer: k-nearest neighbors (8-26); DLDA (8-19); DQDA (11-26) • logitboost (9-21); random forests (6-20); SVM (7-29) • Leukemia: correlation analysis + weighted vote (85% on test set and 95% • on CV set. • Prostate cancer: k-nearest neighbour to genes chosen by t-statistic

  19. Discussion and extensions • Multiple class classification • Unique TSP: when there are >>1 TSPs, which is the most informative? • kTSP: there might be many pairs of genes with informative ordering, combine this information for more accurate classification? • Normalization invariance:a suitable method to integrate heterogeneous microarray datasets where experimental conditions and normalization schemes differ.

More Related