1 / 14

Classification using Decision Trees

Classification using Decision Trees. Data Mining and Information Data Mining and Machine Learning Techniques Decision trees and C5 Applications Plan for this week. Data Mining and Information. Any result should answer a practical or theoretical question.

marla
Download Presentation

Classification using Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification using Decision Trees Data Mining and Information Data Mining and Machine Learning Techniques Decision trees and C5 Applications Plan for this week

  2. Data Mining and Information • Any result should answer a practical or theoretical question. • For your results to be useful, they must be interpretable in most applications. • Data mining -- the process of finding, interpreting, and evaluating patterns in large sets of data.

  3. Data Mining and Machine Learning Techniques • Machine learning programs adapt their behavior with experience. To “learn” is to be trained by data with a set of well defined instructions – machine learning algorithms. • Data mining tools are supplements, rather than substitutes, for human knowledge and intuition. • The objective of running the learning algorithm on the data is to find some patterns or trends that will aid in understanding the data.

  4. Model Classification by Outcome

  5. Classification Problem • Given dataset D and class label C, find a classifierd such that misclassification rate of d is minimized. • Goal – to produce accurate classifier and to understand problem structure • Requirements: high accuracy, interpretable, fast construction for very large training data

  6. Decision Trees • A decision tree T encode d (a classifier) in form of a tree • Internal node – binary, k-ary splits • Leaf node – labeled with one class label

  7. Decision Tree Construction • Top-down tree construction schema: • Examine training data and find best splitting attribute for the root node • Partitioning training data • Recur on each child node

  8. Decision Tree Construction (contd.) BuildTree (Node t, Training data D, Split Selection Method S) • Apply S to D to find splitting criterion • If (t is not a leaf node) • create chidren nodes of t • partition D into children partitions • recur on each partition • Endif Three algorithmic components: • Split selection (C5, CART, QUEST, …) • Pruning • Data access

  9. Split Selection Methods • Impurity-based split selection: CART, C5 (most common in today’s data mining tools) • Model-based split selection: QUEST (Loh and Shih, 1997, freeware, available at www.stat.wisc.edu/~loh, quick, unbiased, efficient, statistical tree)

  10. Decision Trees and C5 • One of data mining methods commonly reported in the literature. • C5 is a software package based on decision tree method by J. R. Quinlan. • One major advantage of decision trees over other machine learning techniques is that they produce models (rules) that can be interpreted by humans. • To learn more about Rule Induction …

  11. CSUS Access to C5 • Login to quad • Change directory to /opt/C50Release1 • Read the “ReadMe” file for example and format requirements • You are ready to use C5 • An example of C5 application

  12. Extracting Knowledge from Gene Expression Data: A Case Study of Batten Disease– S. M. Lin • Duke University Medical Center proposed a prototype KDD system to enable scientists to analyze the massive microarray data, form hypotheses, and draw insights directly into underlying mechanisms of diseases. • Data  Microarray database  data mining  patterns  human experts  Genomics knowledge base discoveries

  13. Plan for this week • Monday (Lu, Dunham part II) • DT-based: 1R, ID3, C5, CART • Rule-generating: Prism • Wednesday (Han-ch7, Dunham-part II) • Statistics-based: Regression (D), Naïve Bayes • Distance-based KNN (D) • ANN

More Related