Weka solution for the 2004 KDD Cup Protein Homology Prediction task

Weka solution for the 2004 KDD CupProtein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand

The problem • Detect homologous protein sequences • 153 train sequences * ~1000 sequences ==> 145751 pairs classified as match or not • Very skewed: only 1296 matches (< 1%!) • BUT: excellent attributes

The attributes

Algorithms doing well • 2fold cross-validation, looking only at predictive accuracy: • Linear SVM (with Logistic model on output for better probs, Platt1999) • 10 AdaBoosted unpruned decision trees • Random rules (~ RandomForest, ECML2004 Rule learning WS)

Performance criteria • Top1: fraction of blocks with a homologous sequence ranked top1 (max) • RMSE: root mean squared error (min) • RKL: average rank of the lowest ranked homologous sequence (min) • APR: average of the average precision in each block (max) • Only RMSE depends on absolute values, for all other criteria a good ranking is sufficient

Unique Solution • Voted ensemble of three classifiers: • Linear SVM + logistic model on output • Adaboosted 10 unpruned J48 trees • 10^5 random rules • Non-standard voting: • If SVM and RandomRules agree ==> • Average their probabilities • ELSE • Use Booster as tie-breaker • Lucky (first on Proteins, 18th on Physics)

Ensemble performance

Attribute ranks

What I should have done • Optimize separately • Bagging for better probability estimates • More data engineering (e.g. PCA, …) • View it as an outlier detection problem • Utilize block structure • ?

(Standard) Lessons • Data engineering (good attributes) essential • Ensembles are more robust • Weka is not just an educational tool • [at least some parts scale well] • Java/open source DM tools are competitive • But: could improve Weka considerably ( volunteers and/or sponsors, get in touch :-)

Finally • A big “THANK YOU” to the organizers of the KDD Cup 2004 !

Weka solution for the 2004 KDD Cup Protein Homology Prediction task