1 / 13

Weka solution for the 2004 KDD Cup Protein Homology Prediction task

Weka solution for the 2004 KDD Cup Protein Homology Prediction task. Bernhard Pfahringer Weka Group, University of Waikato, New Zealand. The problem. Detect homologous protein sequences 153 train sequences * ~1000 sequences ==> 145751 pairs classified as match or not

dionne
Download Presentation

Weka solution for the 2004 KDD Cup Protein Homology Prediction task

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Weka solution for the 2004 KDD CupProtein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand

  2. The problem • Detect homologous protein sequences • 153 train sequences * ~1000 sequences ==> 145751 pairs classified as match or not • Very skewed: only 1296 matches (< 1%!) • BUT: excellent attributes

  3. The attributes

  4. Algorithms doing well • 2fold cross-validation, looking only at predictive accuracy: • Linear SVM (with Logistic model on output for better probs, Platt1999) • 10 AdaBoosted unpruned decision trees • Random rules (~ RandomForest, ECML2004 Rule learning WS)

  5. Performance criteria • Top1: fraction of blocks with a homologous sequence ranked top1 (max) • RMSE: root mean squared error (min) • RKL: average rank of the lowest ranked homologous sequence (min) • APR: average of the average precision in each block (max) • Only RMSE depends on absolute values, for all other criteria a good ranking is sufficient

  6. Unique Solution • Voted ensemble of three classifiers: • Linear SVM + logistic model on output • Adaboosted 10 unpruned J48 trees • 10^5 random rules • Non-standard voting: • If SVM and RandomRules agree ==> • Average their probabilities • ELSE • Use Booster as tie-breaker • Lucky (first on Proteins, 18th on Physics)

  7. Ensemble performance

  8. Attribute ranks

  9. What I should have done • Optimize separately • Bagging for better probability estimates • More data engineering (e.g. PCA, …) • View it as an outlier detection problem • Utilize block structure • ?

  10. (Standard) Lessons • Data engineering (good attributes) essential • Ensembles are more robust • Weka is not just an educational tool • [at least some parts scale well] • Java/open source DM tools are competitive • But: could improve Weka considerably ( volunteers and/or sponsors, get in touch :-)

  11. Finally • A big “THANK YOU” to the organizers of the KDD Cup 2004 !

More Related