Comparison of Data Mining Algorithms on Bioinformatics Dataset

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003

Overview • Began as independent study project completed with Dr. Cha in Spring 2002 • Initial goal: Compare data mining algorithms on a public bioinformatics dataset • Later: evaluate stacked generalization approach • Organization of presentation • Introduction to task • Base models and performance • “Stacked” models and performance • Conclusion and Future Work

Introduction: Data Mining Application of machine learning algorithms to large databases Often used to generate models to classify future data based on “training” dataset of known classifications If data is organized well, domain knowledge is not necessary for the data mining practitioner

Introduction: Bioinformatics and Protein Localization Bioinformatics: use of computational methods e.g. data mining to provide insights in molecular biology Have large databases of information about genes; want to figure out the function of their encoded proteins Proteins are expressed in a specific tissue, cell type, or subcellular component (localization) Knowledge of protein localization can shed light on protein’s function

Introduction

Introduction: KDD Cup Dataset KDD Cup: Annual data mining competition sponsored by ACM SIGKDD Training set with target variable supplied and test set with target variable missing supplied Participants submit predictions for test set’s target variable Submissions with the highest accuracy rate (correct predictions/total instances in test set) win Test set’s target variable is publicly available once competition is over

2001 competition focused on bioinformatics including a protein localization task Dataset consisted of various information about anonymized genes of a particular organism including class, phenotype, chromosome, whether essential, and other genes with which it interacts Purpose of this project: compare data mining algorithms on KDD Cup 2001 protein localization dataset Introduction: KDD Cup Dataset Continued

Methods Simplify dataset: reduce number of variables to facilitate working with commercial data mining package (SAS Enterprise Miner) Decided to eliminate variables pertaining to interactions between genes were more of these variables than other types sophisticated relational algorithm was necessary to take full advantage of them Correspondingly, decreased number of target values

Frequency of Classes in KDD Cup Training Set

Frequency of Classes in KDD Cup Test Set

Methods Continued Created subsets by selecting only instances whose target was among nucleus, cytoplasm, and mitochondria, and only non-relational variables Divided training subset into two random subsets of 314 and 313 instances (training and validation) Two actual training datasets this training set: non-sampled raw data (314 instances) sampled dataset in which each target value appeared in equal amounts and contained frequency variable

A variable was excluded as an input if: more than 50% of data missing (none excluded) effectively unary (274 variables excluded) in hierarchy and not most detailed (none excluded) Resulting training sets: 171 variables (170 binary, 1 non-binary categorical) No missing values in any variables Methods Continued

Base Models

Models Artificial Neural Network Fully connected feedforward network One input node for each dummy variable from 171 inputs 1 hidden node and 2 output nodes: dummy values for nucleus and mitochondria 191 randomly initialized weights Trained using Dual Quasi-Newton Optimization to minimize misclassification rate of training set

Decision Tree Used CHAID-like algorithm with a chi-squared p value splitting criterion of 0.2 and model selection based on proportion of instances correctly classified Hybrid ANN/Tree Difficult for ANN to learn with so many variables Used decision tree as feature selector to determine variables to use in training ANN Models Continued

Models Continued Nearest Neighbor Simple Nearest Neighbor algorithm: assigned each instance in dataset to be predicted to class of instance in training set which matched on the greatest number of variables Match defined as having the exact same value In case of ties, value from among possible classes that occurred most frequently in raw training set was used, including when applying to the equally distributed training set

Accuracy rates Statistical comparisons Hybrid Tree-ANN significantly better for non-sampled than equally distributed on test dataset (p < 0.01) Non-sampled dataset Hybrid Tree-ANN not significantly better than non-sampled Tree (p < =0.06) but significantly better than non-sampled ANN (p < 0.05) Preliminary Results

Reference Point for Results Highest accuracy rate on actual test: 71.1% Next 5 between 68.5% and 70.6% My accuracy rates just slightly off due to gene with two localizations Actual competition required prediction with many more possible values for target variable However, actual competitors had more variables with which to work (relational ones)

“Stacked” Models

Stacking Method for combining models Not as common as other methods and no standard way of doing Part of training set used to train level-0, or base, models as usual Dataset built from predictions of base models on remainder of set (validation set in this project) Level-1 model derived from this prediction dataset, rather than training set predictions to prevent model from going with overfit models

Methods Continued Level 1 ANN Same as Level 0 ANN (used Levenberg-Marquardt Optimization because fewer weights) Level 1 Decision Tree Same as Level 0 Tree Level 1 Naïve Bayes Calculated likelihood of each target value based on Bayes rule applied to level-0 predictions Predicted value with highest likelihood

Results of Stacking Approach Continued Accuracy rates Statistical comparisons For non-sampled, all level 1 models significantly better than level 0 ANN For equally distributed, no level 1 models significantly better than level 0 ANN For non-sampled, no level 1 models significantly better than level 0 NN on same dataset

Conclusion and Future Work

Conclusion Stacked generalization produced more accurate predictors of test data than base models overall, though not necessarily significantly so Consistent with intuition and other findings Nearest Neighbor and Hybrid Tree ANN more accurate than ANN and Tree alone, though not necessarily significantly so May need better trained ANN and tree

Conclusion Continued Three types of level-1 models performed comparably Other research suggests linear models may work best for stacking, so Bayesian might be expected to perform best A priori-type search on prediction dataset before Bayesian training to reject conclusions without enough support may improve performance of Bayesian

Conclusion Continued Non-sampled training dataset (with target distribution found in raw data) produced more accurate models than equally distributed training dataset Sample size may have been too small Could try without weight variable since it’s likely that prior probabilities aren’t known (unless localization of all genes for this organism are known)

Future Work Use cross-validation to obtain better estimates of error, both overall and for creating the level-1 training dataset Dividing training into two may have resulted in too few instances and inputs Changing stacking approach Use posterior probablities instead of predictions Use different or modified algorithms (more linear, add a priori to bayesian) Use a level-2 model on these level-1 models

Future Work Continued Stratify training and validation datasets to keep distribution the same as in the original training set Run chi-squares on all combinations of models and adjust for multiple comparisons (cross-validation usually preferred method) Try on complete KDD Cup dataset

Comparison of Data Mining Algorithms on Bioinformatics Dataset

Comparison of Data Mining Algorithms on Bioinformatics Dataset

Presentation Transcript

Data Mining and Bioinformatics

Bioinformatics Algorithms and Data Structures

Data Mining Algorithms

Data Mining and Bioinformatics

Bioinformatics Algorithms and Data Structures

G53BIO – Bioinformatics Biological Data Mining

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Data Mining – Basics of Bioinformatics

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures

Data Mining Algorithms

Bioinformatics Algorithms and Data Structures