150 likes | 241 Views
Team WALA – Week 3. Team: Will Darby, Alfredo Lopez, Len Russo, Alan Levin (Team Leader) Application: Gene Expression in Human Breast Cancer Data Experiments: Two-way Hierarchical Clustering and SVM using TMeV Gene Modules using GRAM # of Features needed for SimpleKMeans to be Effective
E N D
Team WALA – Week 3 • Team: • Will Darby, Alfredo Lopez, Len Russo, Alan Levin (Team Leader) • Application: Gene Expression in Human Breast Cancer Data • Experiments: • Two-way Hierarchical Clustering and SVM using TMeV • Gene Modules using GRAM • # of Features needed for SimpleKMeans to be Effective • Result of 2nd Human Breast Cancer dataset using SimpleKMeans and TMeV • Genetic Algorithms • Gene Modules using ReMoDiscovery
Results – TMeV • Results of TMeV 2-way Hierarchical Clustering trials. We were able to show (see screenshots on the Wiki): • The correlation between Luminal A & B and HER+/ER- • The correlation between Basal-like and Claudin-Low • Most of the Normal Breast-like grouped together. • Results of TMeV SVM trials go here • Results of TMeV K-Means Clustering • Ran a number of different trials using differing # of classes • Clustering results were substantially different than Weka
Gene Modules using GRAM • Gene module – a set of coexpressed genes which bind with the same set of transcription factors • GRAM algorithm (http://psrg.lcs.mit.edu/GRAM/Index.html): • exhaustive search among all possible combinations of transcription factors based on DNA-binding data • search is made as restrictive as possible since it values having no false positives over producing some false negatives • search produces the initial gene module and its activators. • relax the restriction on the binding data and includes additional genes with the same binding data and similar expression data as those in the module. • GRAM correctly predicts known gene modules • GRAM predicts previously unknown linkages – basis for future experiments
Gene Modules • Described in “Computational discovery of gene module and regulatory networks” • PhD Thesis: Georg Gerber, MIT • GRAM Algorithm (Genetic RegulAtory Modules) implements approach • http://psrg.lcs.mit.edu/GRAM/Index.html • Inputs: • Expression data file • Simple transform from Gene Expression Data • Gene regulator binding data • http://staffa.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=17&f=search • Web site example: • Format for input files • Description of ‘publicly available’ Yeast Expression Data • Pointer to Young Lab for binding of regulators to Genes.
GRAM Results • Gene Modules (small circles) are groups of Genes • Used as seeds for cluster analysis • Transcription Factors (red rectangles) regulate genes within the module • Gene relationships (dashed rectangles) were manually assessed after processing was complete to validate approach
Lessons Learned • Advanced Bioinformatics relies on thorough subject matter knowledge. • e.g. Georg Gerber first received an M.P.H. in Infectious Diseases from Berkeley. • No longer about processing ‘data’ – instead uses ‘expert’ information to aid algorithms. • Finding appropriate data sources is difficult • Expression data referred to ‘well-known’ datasets • Public regulator data is highly specialize, e.g Saccharomyces cerevisiae • Next Steps: • Additional search for regulators related to Breast Cancer gene set • Look at different algorithms for Gene Modules, e.g. Inferelator
SimpleKMeans - # of Features needed for an Effective Classifier • Week 2 results appeared to show that to be an effective classifier SimpleKMeans needed many more features to be included than did Random Forest • This is not the case • What really was being seen is the fact that there is a large difference in results due to selecting different starting centroids – as many as 50 more misclassifications • Results posted on the Wiki show similar results between SimpleKMeans and Random Forest when the best starting centroids are used • Best Result: Used 138 CFS features & 9 Clusters – 88%
Gene Modules with ReMoDiscovery • ReMoDiscovery uses sequence motif data as a primary data source. Maximizes gene expression, and minimizes the number of regulators & motifs in each gene module. Ranks all the genes in the modules as to probability of inclusion • Readme PDF:http://homes.esat.kuleuven.be/~kmarchal/Supplementary_Information_Lemmens_2006/readme.pdf • Zip file:http://homes.esat.kuleuven.be/~kmarchal/Supplementary_Information_Lemmens_2006/
Mathematical Description of the Human Breast Cancer Classification Problem • We are trying to solve: A*b = c where A is 232x1544, b is 1544x1 and c is 232x1. • We know A (the data matrix) and c (the class assignment) = [0 to 5] for each exemplar. • If we knew b - the weight matrix - we would be done. But the problem is very ill-conditioned. • We can formulate a least squares problem: A’*A*b = A’*c è b = (A’*A)-1A’*c • A’*A determines the difficulty of the problem and is 232x232 • It seems reasonable that less than 250 features should be required for a classifier.
Genetic Algorithms for Microarray Data • A genetic algorithm is a stochastic way of searching the data space. Having many more features than examples may cause us to vary the implementation. • To use a genetic algorithms we need several things: • A way of generating populations at each epoch • A way of evolving populations and determining when to stop • A definition of the “chromosome” and ways of evolving the chromosome • An objective and a measure of success • For the Breast cancer data, the chromosome will be a string of 1s and 0s equal in length to the number of features. • Each chromosome in the population will select attributes and score the training set (a subsample of the data).
Running the Genetic Algorithm • An epoch is the time over which a population lives, I.e., the time from creation until the population stalls and does not increase in fitness. • A subset of the attributes will be used for each epoch. • This subset will consist of N random attributes where N may change on an epoch by epoch basis. • Each population will evolve by crossing-over (most likely at two points) and mutating the chromosomes. • The fitness of the resultant individuals will determining stochastically whether they stay in the population. • At the end of an epoch: • The poorest attributes will be deleted from the chromosomes. • New attributes will be selected from the pool of unused attributes and the chromosomes will randomly flip these attributes on. • We anticipate that N will grow with each epoch. • Old chromosomes of good breeding will be randomly retained and new chromosomes using the new set of attributes will be added. • The next epoch will begin.
Assumptions and Scoring • The assumptions about population affect how the genetic algorithm will work. We imaging computation will be intense; however, the goal is to find subsets of the chromosomes that are relevant in some manner such as score on the testing data. If we want we can start with a new population and train in a cross-validated manner. • Scoring is a bit unusual. For each epoch we propose solving a least-squares algorithm to get a weight vector. If D is the data matrix, C, the class assignment then Dw = C where w is the weight vector. Obviously the weights only act where there is a “1” in the chromosome. • We must design the objective. Something like: N1sx norm(Dw-c) will be used. The lower the score the better so results using a lot of attributes will be penalized.
Prognosis for using Genetic Algorithms • Genetic algorithms are highly stochastic and rely heavily on the artistry of the engineer. In this day of SVMs, and ensemble methods, do they have a role? Especially in the case of SVMs, the objective is very well defined. • Genetic algorithms are highly computationally intense but can search disparate parts of the feature space. They usually yield many good, but not optimal solutions. • How do they compare to Random Forests, say? Random Forests look at different parts of the feature space as well, but once features are selected, the resultant tree is well defined by the Gini or information criterion. In the case of Genetic Algorithms, there is no hard-fast rule for design and unexpected things may happen while evolving.