1 / 14

Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms

Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms. From Ch 8 of Instace selection and Costruction for Data Mining (2001) By Carlos Domingo et.al., Kruwer Academic Publishers ( Summarized by Jinsan Yang, SNU Biointelligence Lab) . Abstract

ldenis
Download Presentation

Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) By Carlos Domingo et.al., Kruwer Academic Publishers (Summarized by Jinsan Yang, SNU Biointelligence Lab) 

  2. Abstract • Methods for large amounts of data • Adaptive sampling method instead of random sampling • Keywords Data Mining, Knowledge Discovery, Scalibility, Adaptive sampling, Concentration Bounds

  3. Outline • Introduction • General Rule Selection Problem • Adaptive Sampling Algorithm • An Application of Adaselect • Problem and Algorithm • Experiments • Concluding Remarks

  4. Introduction (1) • Analysis of Large data • Redesign a known algorithm • Reduce the data size • A typical task in data mining • Finding or selecting some rules or laws (General Rule Selection) • General Rule Selection: by random sampling (Batch Sampling) • Proper sample size: by Concentration Bounds or Deviation bounds (Chernoff, Hoeffding bounds) • Problems • Immense sample size is needed for good accuracy and confidence • For the batch sampling, the sample size should be determined a priori as the worst size and it is overestimated for most of the situations

  5. Introduction (2) • Overcoming • Sampling in online sequential fashion (one by one or block by block) • Adaptive sample sizes (adaptive sampling)

  6. General Rule Selection Problem • Given Date D (discrete, categorical ?) and model set H, Select a model h with maximum value of Utility U(h) (supervised learning)

  7. Adaptive Sampling Algorithm (1) • Extension of Hoeffding bound • Reliability of Algorithm

  8. Adaptive Sampling Algorithm (2)

  9. An Application of Adaselect (1) • Canapply as a tool for the General rule selection problem • Example chosen: A boosting based classification algorithm that uses a simple decision stump learner as a base learner. • Decision stump: a single-split decision tree. • AdaBoost for boosting by sub-sampling or re-weighting. • Apply adaptive sampling to base learner (boosting by filtering). • Use MadaBoost by controlling the initial weight as bounded.

  10. An Application of Adaselect (2) • Algorithm • Data: discrete instance vector with labels • Classification rule: decision stump • 0-1 error measure, U: Utility Function Average Prediction

  11. An Application of Adaselect (3) • Experiments • Discretize by 5 intervals and treat missing value as another value. • Artificial inflation (100 copies) of original UCI data • Only for 2 classes • 10 fold cross validation and the results are averaged over 10 runs • Computer: cpu alpha 600MHz, 250Mb memory, 4.3 Gb Hard under Linux • C4.5 and Naïve Bayes classifier for comparison • Boosting round: 10 • Number of all possible decision stumps: (set of weighted majority of ten depth-1 decision tree)

  12. An Application of Adaselect (4)

  13. An Application of Adaselect (5) • AdaSel is faster than C4.5 • faster in large sample size.

  14. Concluding Remarks • Justification and efficiency analysis • Applied in the design of a base learner for a boosting algorithm

More Related