1 / 42

Systematic Data Selection to Mine Concept Drifting Data Streams

Systematic Data Selection to Mine Concept Drifting Data Streams. Wei Fan IBM T.J.Watson. About … …. Data streams: continuous stream of new data, generated either in real time or periodically. Credit card transactions Stock trades. Insurance claim data. Phone call records Our notations.

coby
Download Presentation

Systematic Data Selection to Mine Concept Drifting Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson

  2. About …… • Data streams: continuous stream of new data, generated either in real time or periodically. • Credit card transactions • Stock trades. • Insurance claim data. • Phone call records • Our notations.

  3. t2 t3 t4 t5 t1 Old data Data Streams New data

  4. Data Stream Mining • Characteristics: may change over time. • Main goal of stream mining: • make sure that the constructed model is the most accurate and up-to-date.

  5. Data Sufficiency • Definition: • A dataset is considered “sufficient” if adding more data items will not increase the final accuracy of a trained model significantly. • We normally do not know if a dataset is sufficient or not. • Sufficiency detection: • Expensive “progressive sampling” experiment. • Keep on adding data and stop when accuracy doesn’t increase significantly. • Dependent on both dataset and algorithm • Difficult to make a general claim

  6. Possible changes of data streams • Possible“concept drift”. • For the same feature vector, different class labels are generated at some later time • Or stochastically, with different probabilities. • Possible“data sufficiency”. • Other possible changes not addressed in our paper. • Most important of all: • These are “possibilities”. • No “Oracle” out there to tell us the truth! • Dangerous to make assumptions.

  7. How many combinations? • Four combinations: • Sufficient and no drift. • Insufficient and no drift. • Sufficient and drift. • Insufficient and drift • Question: Does the “most accurate model” remain the same under all four situations?

  8. Case 1: Sufficient and no drift • Solution one: • Throw away old models and data. • Re-train a new model from new data. • By definitions of data sufficiency. • Solution two: • If old model is trained from “sufficient data”, just use the old model

  9. Case 2: Sufficient and drift • Solution one: • Train a new model from new data • Same “sufficiency definition”.

  10. Case 3: Insufficient and no drift • Possibility I: if old model is trained from sufficient data, keep the old model. • Possibility II: otherwise, combine new data and old data, and train a new model.

  11. Case 4: Insufficient and drift • Obviously, new data is not enough by definition. • What are our options. • Use old data? • But how?

  12. A moving hyper plane

  13. A moving hyper plane

  14. See any problems? • Which old data items can we use?

  15. We need to be picky

  16. Inconsistent Examples

  17. Consistent examples

  18. See more problems? • We normally never know which of the four cases a real data stream belongs to. • It may change over time from case to case. • Normally, no truth is known apriori or even later.

  19. Solution • Requirements: • The right solution should not be “one size fits all.” • Should not make any assumptions. Any assumptions can be wrong. • It should be adaptive. • Let the data speak for itself. • We prefer model A over model B if the accuracy of A on the evolving data stream is likely to be more accurate than B. • No assumptions!

  20. An “Un-biased” Selection framework • Train FN from new data. • Train FN+ from new data and selected consistent old data. • Assume FO is the previous most accurate model. Update FO using the new data. Call it FO+. • Use cross-validation to choose among the four candidate models {FN, FN+, FO, and FO+}.

  21. Consistent old data • Theoretically, if we know the true models, we can use the true models to choose consistent data. But we don’t • Practically, we have to rely on “optimal models.” • Go back to the hyper plane example

  22. A moving hyper plane

  23. Their optimal models

  24. True model and optimal models • True model. • Perfect model: never makes mistakes. • Not always possible due to: • Stochastic nature of the problem • Noise in training data • Data is insufficient • Optimal model: defined over a given loss function.

  25. Optimal Model • Loss function L(t,y) to evaluate performance. • t is true label and y is prediction • Optimal decision decision y* is the label that minimizes the expected loss when x is sampled many times: • 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud • cost-sensitive loss: the label that minimizes the “empirical risk”. • If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud

  26. Random decision trees • Train multiple trees. Details to follow. • Each tree outputs posterior probability when classifying an example x. • The probability outputs of many trees are averaged as the final probability estimation. • Loss function and probability are used to make the best prediction.

  27. Training • At each node, an un-used feature is chosen randomly • A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. • A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

  28. Example Gender? M F Age>30 P: 1 N: 9 n y …… P: 100 N: 150 Age> 25

  29. Training: Continued • We stop when one of the following happens: • A node becomes empty. • Or the total height of the tree exceeds a threshold, currently set as the total number of features. • Each node of the tree keeps the number of examples belonging to each class.

  30. Classification • Each tree outputs membership probability • p(fraud|x) = n_fraud/(n_fraud + n_normal) • If a leaf node is empty (very likely for when discrete feature is tested at the end): • Use the parent nodes’ probability estimate but do not output 0 or NaN • The membership probability from multiple random trees are averaged to approximate as the final output • Loss function is required to make a decision • 0-1 loss: p(fraud|x) > 0.5, predict fraud • cost-sensitive loss: p(fraud|x) $1000 > $90

  31. N-fold Cross-validation with Random Decision Trees • Tree structure is independent from the data. • Compensation when computing probability

  32. Key advantage • n-fold cross validation comes easy. • Same cost as testing the model once on the training data. • Training is efficient since we do not compute information gain. • It is actually also very accurate.

  33. Experiments • I have a demo available to show. Please contact me. • In the paper. I have the following experiments. • Synthetic datasets. • Credit card fraud datasets. • Donation datasets.

  34. Compare • This new selective framework proposed in this paper. • Our last year’s hard coded ensemble framework. • Use k number of weighted ensembles. • K=1. Only train on new data. • K=8. • Use new data and previous 7 periods of model. • Classifier is weighted against new data. • Sufficient and insufficient. Always drift.

  35. Data insufficient: new method

  36. Last year’s method

  37. Avg Result

  38. Data sufficient: new method

  39. Data sufficient: last year’s method

  40. Avg Result

  41. Independent study and implementation of random decision tree • Kai Ming Ting and Tony Liu from U of Monash, Australia on UCI datasets • Edward Greengrass from DOD on their data sets. • 100 to 300 features. • Both categorical and continuous features. • Some features have a lot of values. • 2000 to 3000 examples. • Both binary and multiple class problem (16 and 25)

  42. Related publications on random trees • “Is random model better? On its accuracy and efficiency” ICDM 2003 • “On the optimality of probability estimation by random decision trees” AAAI 04. • “Mining concept-drifting data streams with ensemble classifiers” SIGKDD2003

More Related