1 / 29

Comparison of Data Mining Algorithms on Bioinformatics Dataset

Comparison of Data Mining Algorithms on Bioinformatics Dataset. Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003. Overview. Began as independent study project completed with Dr. Cha in Spring 2002 Initial goal: Compare data mining algorithms on a public bioinformatics dataset

tameka
Download Presentation

Comparison of Data Mining Algorithms on Bioinformatics Dataset

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003

  2. Overview • Began as independent study project completed with Dr. Cha in Spring 2002 • Initial goal: Compare data mining algorithms on a public bioinformatics dataset • Later: evaluate stacked generalization approach • Organization of presentation • Introduction to task • Base models and performance • “Stacked” models and performance • Conclusion and Future Work

  3. Introduction: Data Mining Application of machine learning algorithms to large databases Often used to generate models to classify future data based on “training” dataset of known classifications If data is organized well, domain knowledge is not necessary for the data mining practitioner

  4. Introduction: Bioinformatics and Protein Localization Bioinformatics: use of computational methods e.g. data mining to provide insights in molecular biology Have large databases of information about genes; want to figure out the function of their encoded proteins Proteins are expressed in a specific tissue, cell type, or subcellular component (localization) Knowledge of protein localization can shed light on protein’s function

  5. Introduction

  6. Introduction: KDD Cup Dataset KDD Cup: Annual data mining competition sponsored by ACM SIGKDD Training set with target variable supplied and test set with target variable missing supplied Participants submit predictions for test set’s target variable Submissions with the highest accuracy rate (correct predictions/total instances in test set) win Test set’s target variable is publicly available once competition is over

  7. 2001 competition focused on bioinformatics including a protein localization task Dataset consisted of various information about anonymized genes of a particular organism including class, phenotype, chromosome, whether essential, and other genes with which it interacts Purpose of this project: compare data mining algorithms on KDD Cup 2001 protein localization dataset Introduction: KDD Cup Dataset Continued

  8. Methods Simplify dataset: reduce number of variables to facilitate working with commercial data mining package (SAS Enterprise Miner) Decided to eliminate variables pertaining to interactions between genes were more of these variables than other types sophisticated relational algorithm was necessary to take full advantage of them Correspondingly, decreased number of target values

  9. Frequency of Classes in KDD Cup Training Set

  10. Frequency of Classes in KDD Cup Test Set

  11. Methods Continued Created subsets by selecting only instances whose target was among nucleus, cytoplasm, and mitochondria, and only non-relational variables Divided training subset into two random subsets of 314 and 313 instances (training and validation) Two actual training datasets this training set: non-sampled raw data (314 instances) sampled dataset in which each target value appeared in equal amounts and contained frequency variable

  12. A variable was excluded as an input if: more than 50% of data missing (none excluded) effectively unary (274 variables excluded) in hierarchy and not most detailed (none excluded) Resulting training sets: 171 variables (170 binary, 1 non-binary categorical) No missing values in any variables Methods Continued

  13. Base Models

  14. Models Artificial Neural Network Fully connected feedforward network One input node for each dummy variable from 171 inputs 1 hidden node and 2 output nodes: dummy values for nucleus and mitochondria 191 randomly initialized weights Trained using Dual Quasi-Newton Optimization to minimize misclassification rate of training set

  15. Decision Tree Used CHAID-like algorithm with a chi-squared p value splitting criterion of 0.2 and model selection based on proportion of instances correctly classified Hybrid ANN/Tree Difficult for ANN to learn with so many variables Used decision tree as feature selector to determine variables to use in training ANN Models Continued

  16. Models Continued Nearest Neighbor Simple Nearest Neighbor algorithm: assigned each instance in dataset to be predicted to class of instance in training set which matched on the greatest number of variables Match defined as having the exact same value In case of ties, value from among possible classes that occurred most frequently in raw training set was used, including when applying to the equally distributed training set

  17. Accuracy rates Statistical comparisons Hybrid Tree-ANN significantly better for non-sampled than equally distributed on test dataset (p < 0.01) Non-sampled dataset Hybrid Tree-ANN not significantly better than non-sampled Tree (p < =0.06) but significantly better than non-sampled ANN (p < 0.05) Preliminary Results

  18. Reference Point for Results Highest accuracy rate on actual test: 71.1% Next 5 between 68.5% and 70.6% My accuracy rates just slightly off due to gene with two localizations Actual competition required prediction with many more possible values for target variable However, actual competitors had more variables with which to work (relational ones)

  19. “Stacked” Models

  20. Stacking Method for combining models Not as common as other methods and no standard way of doing Part of training set used to train level-0, or base, models as usual Dataset built from predictions of base models on remainder of set (validation set in this project) Level-1 model derived from this prediction dataset, rather than training set predictions to prevent model from going with overfit models

  21. Methods Continued Level 1 ANN Same as Level 0 ANN (used Levenberg-Marquardt Optimization because fewer weights) Level 1 Decision Tree Same as Level 0 Tree Level 1 Naïve Bayes Calculated likelihood of each target value based on Bayes rule applied to level-0 predictions Predicted value with highest likelihood

  22. Results of Stacking Approach Continued Accuracy rates Statistical comparisons For non-sampled, all level 1 models significantly better than level 0 ANN For equally distributed, no level 1 models significantly better than level 0 ANN For non-sampled, no level 1 models significantly better than level 0 NN on same dataset

  23. Conclusion and Future Work

  24. Conclusion Stacked generalization produced more accurate predictors of test data than base models overall, though not necessarily significantly so Consistent with intuition and other findings Nearest Neighbor and Hybrid Tree ANN more accurate than ANN and Tree alone, though not necessarily significantly so May need better trained ANN and tree

  25. Conclusion Continued Three types of level-1 models performed comparably Other research suggests linear models may work best for stacking, so Bayesian might be expected to perform best A priori-type search on prediction dataset before Bayesian training to reject conclusions without enough support may improve performance of Bayesian

  26. Conclusion Continued Non-sampled training dataset (with target distribution found in raw data) produced more accurate models than equally distributed training dataset Sample size may have been too small Could try without weight variable since it’s likely that prior probabilities aren’t known (unless localization of all genes for this organism are known)

  27. Future Work Use cross-validation to obtain better estimates of error, both overall and for creating the level-1 training dataset Dividing training into two may have resulted in too few instances and inputs Changing stacking approach Use posterior probablities instead of predictions Use different or modified algorithms (more linear, add a priori to bayesian) Use a level-2 model on these level-1 models

  28. Future Work Continued Stratify training and validation datasets to keep distribution the same as in the original training set Run chi-squares on all combinations of models and adjust for multiple comparisons (cross-validation usually preferred method) Try on complete KDD Cup dataset

More Related