Data Mining

Data Mining Group Members: Yue Ma Patrick Michel Alecia C. Campbell Paul Romain Yijia Guo

What is Data Mining? • Structural Patterns: It tells us exactly what the data is about extracted from the information given • Programs: Help us to detect patterns and regularities in the data • Strong Pattern: Can be used to make predictions • Problem 1: most patterns are not interesting • Problem 2: patterns may be inexact if data is garbled or missing

What is Data Mining? • Machine Learning : Techniques for describing and finding structural patterns in the data –algorithms • Structural Descriptions: Represent patterns explicitly • Predict outcome in new situation • Explain how prediction is derived • Methods: Artificial intelligence, statistics and research on databases • Consequently, this will help us to formulate a prediction about the data for science, medicine and business

Data

Data: Classification vs Association Rules • Classification rule: predicts value of pre-specified attribute (If …. Then) • Associations rule: predicts value of arbitrary attribute or combination of attributes (multiple used of If … Then).

Data:Associations rule

Decision Tree Figure 1.2 Decision trees

Decision Tree Introduction • What is a Decision Tree? • A decision tree is a predictive model; that is, a mapping from observations about an item to conclusions about its target value. • Each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision. A leaf of the tree specifies the expected value of the goal attribute for the records described by the path from root to leaf.

Decision Tree Representation • Decision trees can also be represented as sets of if-then rules to improve human readability. Please see example below:

Decision trees classification • Decision trees classify its instances in the form of nodes. Namely, each branch descending from that node corresponds to one of the possible values of this attribute. • A leaf node - indicates the value of the target attribute (class) of examples, or a decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. Each leaf node assigns a classification. This is done by sorting them down from the root to the leaf node; which provides the classification of the instance. • Each internal node in the tree specifies a test of some attribute of the instance.

A Look at Decision Tree • ID3 learning algorithm • The algorithm is based on Ocam’s Razor: it prefers smaller decision trees over larger ones. However, it does not always produce the smallest tree, and is therefore a hueristic. • The ID3 algorithm can be summarized as follows: • Take all unused attributes and count their entropy concerning test samples • Choose attribute for which entropy is minimum • Make node containing that attribute

Information Gain • Definition:In general terms, the expected information gain is the change in information entropy from a prior state to a state that takes some information as given: • IG(Ex,a) = H(Ex) − H(Ex | a)

A Look at Decision Tree • Information gain as it relates to the Decision Trees • In particular, the information gain about a random variable X obtained from an observation that a random variable A takes the value A=a is the Kullback-Leibler divergence DKL( p(x|a) || p(x|I) ) of the prior distribution p(x|I) for x from the posterior distribution p(x|a) for x given a.

2.8 Drawbacks to the use of Information Gain in Decision Trees Algorithm • Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. • For example, suppose that we are building a decision tree for some data describing a business's customers. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. • This attribute has a high information gain, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before.

Demo I want to play tennis today.

Data of weather • Outlook: sunny, overcast, rainy • Temperate: hot, mild, cool • Humidity: high, normal • Windy: true, false • Play: Yes, No

Day 1 • Outlook: Rain • Wind: Yes • Play: No

Day 2 • Outlook: Sunny • Humidity: High • Play: No

Day 3 • Outlook: Overcast • Wind: No • Humidity: Normal • Play: Yes

Weka : Data Mining Software • Weka is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. • Weka is open source software issued under the GNU General Public License.

Decision Tree: J48 • A new example situation can be classified simply by tracing a path from the root of the tree to a leaf, with the path taken being determined by the input attribute values of the example. • J48 is Weka's implementation of the C4.5 decision tree learning algorithm invented by Ross Quinlan from Sydney University.

Decision Tree: J48 (outlook = “sunny” ^ humidity = “normal”) v (outlook = “overcast”) v (outlook = “rain” ^ wind = “weak”)

Decision Tree: ID3 • How does ID3 decide which attribute is the best? A statistical property, called information gain, is used. Gain measures how well a given attribute separates training examples into targeted classes.

Decision Tree: ID3

The future of data mining ?

Data Mining

Data Mining

Presentation Transcript

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining: Data

Data Mining

DATA MINING

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data