1 / 10

Presentation by Tim Hamilton

Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan. Presentation by Tim Hamilton. “Genechips”. DNA microarrays – a collection of microscopic DNA spots representing single genes.

magee
Download Presentation

Presentation by Tim Hamilton

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genetic algorithms applied to multi-class prediction for the analysis of gene expressions dataC.H. Ooi & Patrick Tan Presentation by Tim Hamilton

  2. “Genechips” • DNA microarrays – a collection of microscopic DNA spots representing single genes. • Commonly used to monitor expression levels of thousands of genes at once.

  3. Classification • Gene expression data is commonly used in the classification of a biological sample. • Tumor subtypes • Response to certain types of treatment (e.g. chemotherapy). • Most approaches focus on classification of two, or at most three classes, and have high rates of error when run on sets containing multiple classes (19%) • Propose using GA for analyzing multiple-class expression data.

  4. Reduced performance of previous rank-based approaches because of: 1) missing correlations between genes. 2) Predictor set size must be specified. • Data Sets used for the GA: • NCI60: expression profiles of 64 cancer cell lines containing 9703 cDNA sequences. • GCM: expression profiles for 198 tumor samples, 90 normal samples, and 20 unknowns containing 16063 genes. • Both data sets were pre-processed to generate a truncated 1000-gene dataset, color ratio of a single spot – color ration of all spots / standard deviation. Kept the genes with the highest standard deviation.

  5. Choosing a GA chromosome • Determine some minimum and maximum gene range for selection. [Rmin, Rmax] • Chromosome string: [R g1 g2… gRmax ] - R is the size of the predictive set - any genes past length R are ignored. - genes are chosen from the list of 1000.

  6. Parameters • Population size: 100 • Generations: 100 Other parameters were varied • Crossover method: one-point or universal • Selection method: stochastic universal sampling (SUS) or roulette wheel selection (RWS) • Probability of Crossover : 0.7 – 1.0 • Probability of mutation: 0.0005 – 0.01 • Predictor set size range [Rmin, Rmax]: [5, 10], [11, 15], [16, 20], [21, 25], [26,30]; • For each predictor set size this produced 96 different runs • Run on both the truncated set, and the full data set for comparison.

  7. Each generation of chromosomes is used to classify the data sets using a maximum likelihood (MLHD) method. • Fitness = 200 – (E1 + E2) • E1 = cross validation error rate • E2 = independent test error rate. • The MLHD classifier involves a lot of math, but is based upon Bayes Rule • Used two previous rank-based methods on the same truncated data set for comparison.

  8. Results • Uniform crossover produced the best predictors in size ranges [11,15] and [16,20] • One-point crossover best in ranges [5,10], [21,25] and [26,30]. • Higher predictive accuracies when run against the truncated data set.

  9. Results vs. Other Methods

  10. Finally, GA compared to another method using SVM classification. • The SVM had best performance when all 16063 genes of a data-set were used, 22% error • The GA used only 32 elements, 18% error.

More Related