1 / 27

Instance-based Classification

Instance-based Classification. Examine the training samples each time a new query instance is given. The relationship between the new query instance and training examples will be checked to assign a class label to the query instance. KNN: k-Nearest Neighbor.

kbarney
Download Presentation

Instance-based Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instance-based Classification • Examine the training samples each time a new query instance is given. • The relationship between the new query instance and training examples will be checked to assign a class label to the query instance.

  2. KNN: k-Nearest Neighbor • A test sample x can be best predicted by determining the most common class label among k training samples to which x is most similar. • Xj—jth training sample, yj—the class label for xj, Nx—the set of k nearest neighbors of x in training set. Estimate the probability x belongs to ith class:

  3. KNN: k-Nearest Neighbor, con’t • Proportion of K nearest neighbors that belong to ith class: • The ith class which maximizes the proportion above will be assigned as the label of x. • Variants of KNN: filtering out irrelevant genes before applying KNN.

  4. Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring

  5. Publication Info • "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring" • Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Lo, Downing, Caligiuri, Bloomfield, Lander • Appears in Science Volume 286, October 15, 1999 • Whitehead Institute/MIT Center for Genome Research • http://www-genome.wi.mit.edu/cancer • ...and Dana-Farber (Boston), St. Jude (Memphis), Ohio State • ...additional publications by same group shows similar technique applied to different disease modalities.

  6. Cancer Classification Class Discovery: defining previously unrecognized tumor subtypes Class Prediction: assignment of particular tumor samples to already-defined classes • Given bone marrow samples: • Which cancer classes are present among sample? • How many cancer classes? 2, 4? • Given samples are from leukemia patients, what type of leukemia is each sample (AML vs ALL)?

  7. Leukemia: Definitions & Symptoms • Cancer of bone marrow • Myelogenous or lymphocytic, acute or chronic • Acute Myelogenous Leukemia (AML) vs Acute Lymphocytic Leukemia (ALL) • Marrow cannot produce appropriate amount of red and white blood cells • Anemia -> weakness, minor infections; Platlet deficiency -> easy bruising • AML: 10,000 new adult cases per year • ALL: 3,500/2,400 new adult/child cases per year • AML vs. ALL in adults & children

  8. Leukemia: Treatment & expected outcome • Diagnosis via highly specialized laboratory • ALL: 58% survival rate • AML: 14% survival rate • Treatment: chemotherapy, bone marrow transplant • ALL: corticosteroids, vincristine, methotrexate, L-asparaginase • AML: daunorubicin, cytarabine • Correct diagnosis very important for treatment options and expected outcome!!! • Microarray could provide systematic diagnosis option • BUT ONLY ONE TYPE OF DIAGNOSTIC TOOL!!!

  9. Leukemia: Data set • 38 bone marrow samples (27 AML, 11 AML) • 6817 human gene probes

  10. Cancer Class Prediction • Learning Task • Given: Expression profiles of leukemia patients • Compute: A model distinguishing disease classes (e.g., AML vs. ALL patients) from expression data. • Classification Task • Given: Expression profile of a new patient + A learned model (e.g., one computed in a learning task) • Determine: The disease class of the patient (e.g., whether the patient has AML or ALL)

  11. g1,1L g1,nÃclass1 g2,1 L g2,nÃclass2 MOM gm,1L gm,nÃclassm Cancer Class Prediction • n genes measured in m patients Vector for a patient

  12. Cancer Class Prediction Approach • Rank genes by their correlation with class variable (AML/ALL) • Select subset of “informative” genes • Have these genes do a weighted vote to classify a previously unclassified patient. • Test validity of predictors.

  13. Ranking Genes • Rank genes by how predictive they are (individually) of the class… g1,1L g1,nÃclass1 g2,1 L g2,nÃclass2 MOM gm,1L gm,nÃclassm

  14. Ranking Genes • Split the expression values for a given gene g into two pools – one for each class (AML vs. ALL) • Determine their mean m and standard deviation sigma of each pool • Rank genes by correlation metric (separation) P(g, class) = (mALL - mAML)/(sALL + sAML) The mean difference between the classes relative to the SD within the classes.

  15. Neighborhood Analysis Each gene g: V(g) = (e1, e2, …, en), ei: expression level of gene g in ith sample. Idealized pattern: c = (c1, c2, …, cn), ci: 1 or 0 (sample I belongs to class 1 or 2. C* idealized random pattern. Counting the number of genes having various levels of correlation with C, compared with the corresponding distribution obtained for random pattern C*.

  16. Selecting Informative Genes • Select the kALL top ranked genes (highly expressed in ALL) and the kAML bottom ranked genes (highly expressed in AML) P(g, class) = (mALL - mAML)/(sALL + sAML) In Golub’s paper, 25 most positively correlated and 25 most negatively correlated genes are selected.

  17. Determine significant genes 1% significance level means 1% of random neighborhoods contain as many points as observed neighborhood. P(g,c)>0.30 is 709 genes (intersects 1%) Median is ~150 genes (if totally random)

  18. Weighted Voting • Given a new patient to classify, each of the selected genes casts a weighted vote for only one class. • The class that gets the most vote is the prediction.

  19. Weighted Voting • Suppose that x is the expression level measured for gene g in the patient V = P(g,class) X |x – [mALL + mAML]/2| Distance from the measurement to the class boundary -- reflecting the deviation of the expression level in the sample from the average of AML and ALL Weight for gene g – weighting factor reflecting how well the gene is correlated with the class distinction

  20. Prediction • Weighted vote: • VAML=Sviwi|vi is vote for AML where vi=|xi-(mAML+mALL)/2|

  21. Prediction Strength • Can assess the “strength” of a prediction as follows: PS = (Vwinner – Vloser)/(Vwinner+ Vloser) where Vwinner is the summed vote (absolute value) from the winning class, and Vloser is the summed vote (absolute value) for the losing class

  22. Prediction Strength • When classifying new cases, the algorithm ignores those cases where the strength of the prediction is below a threshold… • Prediction = • [ALL, if VALL > VAMLÆ PS > q • [AML, if VAML > VALLÆ PS > q • [No-call, otherwise.

  23. Experiments • Cross validation with the original set of patients • For i = 1 to 38 • Hold the ith sample aside • Use the other 37 samples to determine weights • With this set of weights, make prediction on the ith samples • Testing with another set of 34 patients…

  24. Prediction: Results • "Training set" results were 36/38 with 100% accuracy, 2 unknown via cross-validation (37 train, 1 test) • Independent "test set" consisted of 34 samples • 24 bone marrow samples, 10 peripheral blood samples • NOTE: "training set" was ONLY bone marrow samples • "test set" contained childhood AML samples, different laboratories • Strong predictions (PS=0.77) for 29/34 samples with 100% accuracy • Low prediction strength from questionable laboratory Slection of 8-200 genes gives roughly the same prediction quality.

  25. Cancer Class Discovery • Given • Expression profiles of leukemia patients • Do • Cluster the profiles, leading to discovery of the subclasses of leukemia represented by the set of patients

  26. Cancer Class Discovery Experiment • Cluster the expression profiles of 38 patients in the training set • Using self-organizing maps with a predefined number of clusters (say, k) • Run with k = 2 • Cluster 1 contained 1 AML, 24 ALL • Cluster 2 contained 10 AML, 3 ALL

  27. Cancer Class Discovery Experiment • Run with k = 4 • Cluster 1 contained mostly AML • Cluster 2 contained mostly T-cell ALL • Cluster 3 contained mostly B-cell ALL • Cluster 4 contained mostly B-cell ALL • It is unlikely that the clustering algorithm was able to discover the distinction between T-cell and B-cell ALL cases

More Related