1.16k likes | 1.4k Views
Bayesian belief networks. The Reverend Bayes orders dinner. Given that I ate meatballs for lunch and that I like spaghetti|What do you think is the Probability that I will order the spaghetti with meatballs?. Make-up project: The CoIL 2000 competition.
E N D
Bayesian belief networks The Reverend Bayes orders dinner Given that I ate meatballs for lunch and that I like spaghetti|What do you think is the Probability that I will order the spaghetti with meatballs?
Make-up project: The CoIL 2000 competition Makeup project: (Groups of 2 encouraged, CoIL 2000 competition) Look at the lecture slides dealing with the CARAVAN CoIL 2000 survey competition. Read the competition description and submit an entry. Hint start from the scripts in the lecture slides (use the latest version of the code). Carefully read the competition mission and note that it is important to try to come up with explanatory rules as well as having a good predictive lift model!!!
Bayesian belief networks • Reflections: looking backwards & looking forwards • - Methods • - Applications • - Tasks • - Some challenges for data mining • Final report • - Evaluation criteria • - What to include in your report? • Bayes revisited • - Naïve Bayesian classification (Reread chapter 4.2) • - DMaK™ operators for Bayesian classification • - Bayesian networks or belief networks (read Ch 6.7 pp. 271-283) • Neural networks (if there is time left)
Deadlines: February 1 HW #1 (Web browsing) February 4 no class February 8 HW #2 (Data Visualization), project topic introductory presentations 10 minute quiz on chapters 1 & 2 (1-page 2-sided Cribsheet allowed) February 18 Project Proposal March 1 HW#3 (multivariate “regression”) March 15 10 minute quiz on chapters 4 and 5 March 18 Progress report #1 due March 29 HW#5: Relevant paper discussion April 9 HW#6: Data surveying (prepare 20 question survey + two-page editorial) April 12 10 minute quiz (pp. 88-98; 271-283) (cribsheet) April 19 Progress report #2 due April 26 Guest Lecture (Prof. Bennett on SVMs) April 29 No lecture
http://www.site.uottawa.ca/~nat/Courses/csi5387_2006/nips-criteria.htmlhttp://www.site.uottawa.ca/~nat/Courses/csi5387_2006/nips-criteria.html
Reflections: Looking backwards … and looking forward • What more could we cover? • - Web mining and link analysis • - More on statistics and prediction margins • - More on outlier and novelty detection techniques • - How to include domain knowledge (Bayesian networks?) and transfer of domain-specific methods • - Social networks • - Gene expression arrays and regulatory networks • What is/will become hot in data mining? • - Reinforcement learning strategies • - Causality/Active learning • - Bayesian (belief) networks • - Rough sets • - Data driven expert systems • - Ontologies and linguistics • - One-pass fraud detection with incremental learning • - Translation: including translation between streaming media (i.e., movies to text, text to movies) • - Streaming media and unstructured data (e.g., use of Kolmogorov complexity and multi-variate sequence analysis)
General guidelines for final report As a general rule, your report should be understandable by anyone with a reasonable understanding of machine learning but who doesn't know the particular approaches or the data that you used. Try also, while writing, to imagine that you are conversing with a very interactive reader (who doesn't know anything about your project but who wants to find out everything!). Try to be this reader and to guess all the questions s/he would ask you and all the challenges s/he would have for you. Then incorporate your answers to these questions and challenges in your report so that you have pre-empted many of your (real) readers' questions. Contents In addition to the introduction and conclusion (which can be thought of as summaries of your study directed at a general audience, but with more emphasis on your motivations in the case of the introduction and more emphasis on your results and their implications in the case of the conclusion), your report should contain: - A statement of the problem you are studying. - A review of the related literature on the topic and a discussion of where your study fits in this previous literature. - A description of the method you have designed or of the methods you are comparing. Assume that the reader does not know how the systems you have designed and/or used work. - A description of the data to which you applied your research (this description should include: number of features, values these features can take, size of the data set, size of the training and testing sets, etc.) - A description of the methodology you used to set the various learning parameters of the systems you tested and a discussion of the optimal settings you found. This is particularly relevant in the context of Neural Networks, for example where the Number of Hidden Units, Learning rates, Momentum, Number of RBF's etc. have to be chosen by the user. The idea here is that your results should be reproducible by anyone reading your paper. - A description of your testing methodology (e.g., 10-fold cross-validation) and a discussion of why this testing methodology is appropriate. - A description of your results. Think of the format that would best illustrate the points you are trying to make. Should you list your results in a table? represent them with a graph? what sort of graph? what results are necessary to report? - A discussion of your results. i.e., a section that explains why, in your opinion, the results you reported were obtained: why the learners you considered were successful or why they failed. If you want, you can also discuss what you think would happen under conditions different from those you specifically tested. - A discussion of the relevance of your results: what have you achieved with your study? How do your results support the claims you have made in the earlier parts of your report? - A section discussing future work. There, you should try to identify sets of experiments that would be interesting to run and to discuss why they would be interesting (i.e., what are the issues that such experiments would test). http://www.site.uottawa.ca/~nat/Courses/csi5387_2006/nips-criteria.html
http://www.site.uottawa.ca/~nat/Courses/csi5387_2006/nips-criteria.htmlhttp://www.site.uottawa.ca/~nat/Courses/csi5387_2006/nips-criteria.html
Tom Mitchell [1997] Machine Learning. McGraw-Hill International Paperback Edition Chapter 1: Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. Revised version for possible Inclusion in future edition.
Jiawei Han and M. Kamber [2005] Data Mining Concepts and Techniques, 2nd Ed., Morgan Kaufmann Publishers P. N. Tan, M. Steinbach, and V. Kumar [2006] Introduction to Data Mining. Addison Wesley.
Methods Machine Learning Statistics Computational intelligence Decision trees Naïve Bayesian Classifier Neural Networks Sequence Alignment Clustering Support Vector Machines Bayesian Networks Genetic Algorithms Regulatory Pathways Association Rules Logistic Regression Fuzzy Logic K-PLS Ant colony & Swarm optimization Kernel Methods Multivariate Regression Rough Sets Elements of fractals & chaos Gaussian Mixture Models Ridge Regression Cellular Automata More Regularization … Kernel Methods Latent variable Techniques Kolmogorov Complexity Immunocomputing
Applications Finance Mind Mining Intrusion Detection Bioinformatics Text Mining Make you Rich machines Homeland Security Web Mining Time Series Analysis Market Basket Analysis Virtual Reality Kingdoms Fraud Detection Medical Diagnosis Link Analysis Predictive Modeling ??? Pattern Recognition Molecular Design Causality Bond rating Translation Gene Expression Arrays Social Networks Utility Load Forecasting Search Credit Card Fraud Forensics Sequence Alignment
Tasks Classification Multiple-Class Classification Market Basket Analysis Recommender Systems Clustering Unsupervised Finding Nuggets Fraud Detection Regression Rule Formulation Supervised Causality Semi-Supervised Pattern Recognition Predictive Modeling Regulatory networks Feature Selection Variable Selection Outlier & Novelty Detection
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
CREDIT CARD DATA REVISITED dmak credit 3308 REM MAKE CATS (n 7) dmak credit 121 REM SPLIT (600 2) dmak credit.txt 20 copy cmatrix.txt credit.pat copy dmatrix.txt credit.tes REM DO BAYES dmak credit.pat 122 dmak credit.tes 123 dmak results.ttt 260 pause REM DO RULES (70 20) dmak credit.pat 312 rules Married 263 65 19 253 A13 43 285 18 254 q2 Q2 AUC %Correct BER(%) SEN(%) ARI RI MSE MAE 0.4722 0.5013 0.896 85.556 83.51 85.31 0.498 0.750 0.345 0.16
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
From Trevor Hastie, Modern Trends in Data Mining Stanford University, November, 2006
Data mining Challenges • Extraction techniques for simple plausibe rules • Causality • Regulatory networks • Bioinformatics • Very large datasets • Multivariate time series with stochastic collinearity • Intent dynamics (e.g., ordering of a story) • One-pass searching in streaming media (e.g., search for dog scene) • Outlier and novelty detection • Text mining (e.g., translation)
http://www.ddj.com/development-tools/184406064;jsessionid=ZDXUHNRSXMPGAQSNDLPSKHSCJUNN2JVN?_requestid=231831http://www.ddj.com/development-tools/184406064;jsessionid=ZDXUHNRSXMPGAQSNDLPSKHSCJUNN2JVN?_requestid=231831
Text mining issues • Ontologies and Linguistics • Machine translation • Searching for duplicates/plagiarism • Synthesis • Finding names
A letter from the late Reverend Mr. Thomas Bayes, F. R. S. to John Canton, M. A. and F. R. S. [1763] Thomas Bayes, Philosophical Transactions (1683-1775), Vol. 53, pp. 269-271
LII. An Essay towards solving a Problem in the Doctrine of Chances. By the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F. R. S. “The purpose I mean is, to shew what reason we have for believing that there are in the constitution of things fixt laws according to which things happen, and that, therefore, the frame of the world must be the effect of the wisdom and power of an intelligent cause; and thus to confirm the argument taken from final causes for the existence of the Deity. It will be easy to see that the converse problem solved in this essay is more directly applicable to this purpose; for it shews us, with distinctness and precision, in every case of any particular order or recurrency of events, what reason there is to think that such recurrency or order is derived from stable causes or regulations innature, and not from any irregularities of chance.” Philosophical Transactions of the Royal Society of London 53 (1763), pp. 370–418
Thomas Bayes: Addresses the Inverse Probability Problem • Forward problem: (Bernouilli distribution , a generalization of the binomial distrubition) • - Assuming that an unbiased coin has 50% probability for heads or tails and doing 10 coin tosses, • generate a ransom series based on this assumption • - Coin tosses are independent from each other • - Assume a series of observations based on a gadzillion coin tosses • - Fair questions now could be: • what is the average fraction of heads? • what is the probability for having a HTHTHTT pattern? • Inverse problem: • - Observe a few very finite series of coin tosses • Is this a fair coin? (i.e., what is the heads or tails probability) • What is the probability that this is a fair coin • Collary: • - Leaving out evidence or cherrypicking evidence is a crime
Bayes Theorem H E Terminology: - Hypothesis: H - Evidence: E - Prior or a priori probability: P(H) - Posterior probability P(H|E) - P(E|H) posterior prob of E conditioned on H Trick: - Estimate P(E|H), P(H) and P(E) from observed data
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows: • Let D={E} be a training set of tuples and their associated class labels. • Each tuple is represented by an m-dimensional attribute vector: E contains m measurements made on the tupple for the m attributes E1…Em (ii) Suppose that there are k classes (C) or hypotheses: {H1, H2, …Hk} Given an evidence tuple, E, the classifier will predict that the evidence E belongs to the class (hypothesis) having the highest posterior probability, conditioned on E. The naive Bayesian classifier predicts that tuple X belongs to the class Ci if and only if: We maximize . The class Ci for which P(Ci|X) is maximized is called the Maximum posteriori hypothesis which can be calculated according to Bayes’s theorem: Naïve Bayesian Classifier Jiawei Han and M. Kamber [2005] Data Mining Concepts and Techniques, 2nd Ed , Morgan Kaufmann Publishers
Naïve Bayesian Classifier with Maximum Likelihood Estimator (iii) As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) =…= P(Ck), and we would therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci). Note that the class prior probabilities may be estimated by P(Ci) = Ci,D/D, where Ci,D/D is the number of training tuples of class Ci in {E}. (iv) Given data sets with many attributes, it would be extremely computationally expensive to compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci) the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple, i.e., that there are no dependence relationships among the attributes: We can easily estimate the probabilities P(x1|Ci)*P(x2|Ci)x…*P(xn|Ci) from the training tuples. Recall that here xk refers to the value of attribute k for tuple X. For each attribute, we also look at whether the attribute is categorical or continuous-valued. Jiawei Han and M. Kamber [2005] Data Mining Concepts and Techniques, 2nd Ed , Morgan Kaufmann Publishers
(v) In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The classifier predicts that the class label of tuple X is the class Ci if and only if In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the maximum. Naïve Bayesian Classifier with Maximum Likelihood Estimator (ctd.) Jiawei Han and M. Kamber [2005] Data Mining Concepts and Techniques, 2nd Ed , Morgan Kaufmann Publishers
Example: All Electronic Customer Data Base* Question: Classify the tuple X={age=youth;income=medium,student=yes, CR = fair} *J. R. Quinlan [1986] Induction of decision trees. Machine Learning, Vol. 1, pp. 81-106
Example: All Electronic Customer Data Base (ctd) Question: Classify the tuple X = {age = youth, income = medium,student = yes, CR = fair} Jiawei Han and M. Kamber [2005] Data Mining Concepts and Techniques, 2nd Ed , Morgan Kaufmann Publishers