Data Mining

Data Mining

Outline • What is data mining? • Data Mining Tasks • Association • Classification • Clustering • Data mining Algorithms • Are all the patterns interesting?

What is Data Mining: • Huge amount of databases and web pages make information extraction next to impossible (remember the favored statement: I will bury them in data!) • Inability of many other disciplines: (statistic, AI, information retrieval) to have scalable algorithms to extract information and/or rules from the databases • Necessity to find relationships among data

What is Data Mining: • Discovery of useful, possibly unexpected data patterns • Subsidiary issues: • Data cleansing • Visualization • Warehousing

Examples • A big objection to was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy. • The Rhine Paradox: a great example of how not to conduct scientific research.

Rhine Paradox --- (1) • David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. • He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue. • He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!

Rhine Paradox --- (2) • He told these people they had ESP and called them in for another test of the same type. • Alas, he discovered that almost all of them had lost their ESP. • What did he conclude? • Answer on next slide.

Rhine Paradox --- (3) • He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.

A Concrete Example • This example illustrates a problem with intelligence-gathering. • Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. • We want to find people who at least twice have stayed at the same hotel on the same day.

The Details • 109 people being tracked. • 1000 days. • Each person stays in a hotel 1% of the time (10 days out of 1000). • Hotels hold 100 people (so 105 hotels). • If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious?

Calculations --- (1) • Probability that persons p and q will be at the same hotel on day d : • 1/100 * 1/100 * 10-5 = 10-9. • Probability that p and q will be at the same hotel on two given days: • 10-9 * 10-9 = 10-18. • Pairs of days: • 5*105.

Calculations --- (2) • Probability that p and q will be at the same hotel on some two days: • 5*105 * 10-18 = 5*10-13. • Pairs of people: • 5*1017. • Expected number of suspicious pairs of people: • 5*1017 * 5*10-13 = 250,000.

Conclusion • Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. • Analysts have to sift through 250,010 candidates to find the 10 real cases. • Not gonna happen. • But how can we improve the scheme?

Appetizer • Consider a file consisting of 24471 records. File contains at least two condition attributes: A and D

Appetizer (con’t) • Probability that person has A: P(A)=0.6, • Probability that person has D: P(D)=0.02 • Conditional probability that person has D provided it has A: P(D|A) = P(AD)/P(A)=(272/24471)/.6 = .02 • P(A|D) = P(AD)/P(D)= .54 • What can we say about dependencies between A and D?

Appetizer(3) • So far we did not ask anything that statistics would not have ask. So Data Mining another word for statistic? • We hope that the response will be resounding NO • The major difference is that statistical methods work with random data samples, whereas the data in databases is not necessarily random • The second difference is the size of the data set • The third data is that statistical samples do not contain “dirty” data

Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

Data Mining Tasks • Association (correlation and causality) • Multi-dimensional vs. single-dimensional association • age(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”) [support = 2%, confidence = 60%] • contains(T, “computer”) -> contains(x, “software”) [1%, 75%] • What is support? – the percentage of the tuples in the database that have age between 20 and 29 and income between 20K and 29K and buying PC • What is confidence? – the probability that if person is between 20 and 29 and income between 20K and 29K then it buys PC • Clustering (getting data that are close together into the same cluster. • What does “close together” means?

Distances between data • Distance between data is a measure of dissimilarity between data. d(i,j)>=0; d(i,j) = d(j,i); d(i,j)<= d(i,k) + d(k,j) • Euclidean distance: <x1,x2, … xk> and <y1,y2,…yk> • Standardize variables by finding standard deviation and dividing each xi by standard deviation of X • Covariance(X,Y)=1/k(Sum(xi-mean(x))(y(I)-mean(y)) • Boolean variables and their distances

Data Mining Tasks • Outlier analysis • Outlier: a data object that does not comply with the general behavior of the data • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: regression analysis • Sequential pattern mining, periodicity analysis • Similarity-based analysis • Other pattern-directed or statistical analyses

Are All the “Discovered” Patterns Interesting? • A data mining system/query may generate thousands of patterns, not all of them are interesting. • Suggested approach: Human-centered, query-based, focused mining • Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm • Objective vs. subjective interestingness measures: • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.

Are All the “Discovered” Patterns Interesting? - Example coffee 0 1 tea 5 5 20 25 0 70 75 Conditional probability that if one buys coffee, one also buys tea is 2/9 Conditional probability that if one buys tea she also buys coffee is 20/25=.8 However, the probability that she buys coffee is .9 So, is it significant inference that if customer buys tea she also buys coffee? Is buying tea and coffee independent activities?

How to measure Interestingness • RI = | X , Y| - |X||Y|/N • Support and Confidence: |X Y|/N – support and |X Y|/|X| -confidence of X->Y • Chi^2: (|XY| - E(|XY|)) ^2 /E(|XY|); • J(X->Y) = P(Y)(P(X|Y)*log (P(X|Y)/P(X)) + (1- P(X|Y))*log ((1- P(X|Y)/(1-P(X)) • Sufficiency (X->Y) = P(X|Y)/P(X|!Y); Necessity (X->Y) = P(!X|Y)/P(!X|!Y). Interestingness of Y->X is NC++ = 1-N(X->Y)*P(Y), if N(…) is less than 1 or 0 otherwise

Can We Find All and Only Interesting Patterns? • Find all the interesting patterns: Completeness • Can a data mining system find all the interesting patterns? • Association vs. classification vs. clustering • Search for only interesting patterns: Optimization • Can a data mining system find only the interesting patterns? • Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—mining query optimization

Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered but not if data is “smeared” • Can have hierarchical clustering and be stored in multi-dimensional index tree structures • There are many choices of clustering definitions and clustering algorithms.

Example: Clusters Outliers x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods • Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data • Sampling may not reduce database I/Os (page at a time).

Raw Data Sampling SRSWOR (simple random sample without replacement) SRSWR

Sampling Cluster/Stratified Sample Raw Data

Discretization • Three types of attributes: • Nominal — values from an unordered set • Ordinal — values from an ordered set • Continuous — real numbers • Discretization: • divide the range of a continuous attribute into intervals • Some classification algorithms only accept categorical attributes. • Reduce data size by discretization • Prepare for further analysis

Discretization • Discretization • reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

Discretization Sort Attribute Select cut Point Evaluate Measure NO NO Satisfied Yes DONE Split/Merge Stop

Discretization • Dynamic vs Static • Local vs Global • Top-Down vs Bottom-Up • Direct vs Incremental

Discretization – Quality Evaluation • Total number of Intervals • The Number of Inconsistencies • Predictive Accuracy • Complexity

Discretization - Binning • Equal width – all range is between min and max values is split in equal width intervals • Equal-frequency - Each bin contains approximately the same number of data

Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is • The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. • The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., • Experiments show that it may reduce data size and improve classification accuracy

Data Mining Primitives, Languages, and System Architectures • Data mining primitives: What defines a data mining task? • A data mining query language • Design graphical user interfaces based on a data mining query language • Architecture of data mining systems

Why Data Mining Primitives and Languages? • Data mining should be an interactive process • User directs what to be mined • Users must be provided with a set of primitivesto be used to communicate with the data mining system • Incorporating these primitives in a data mining query language • More flexible user interaction • Foundation for design of graphical user interface • Standardization of data mining industry and practice

What Defines a Data Mining Task ? • Task-relevant data • Type of knowledge to be mined • Background knowledge • Pattern interestingness measurements • Visualization of discovered patterns

Task-Relevant Data (Minable View) • Database or data warehouse name • Database tables or data warehouse cubes • Condition for data selection • Relevant attributes or dimensions • Data grouping criteria

Types of knowledge to be mined • Characterization • Discrimination • Association • Classification/prediction • Clustering • Outlier analysis • Other data mining tasks

A Data Mining Query Language (DMQL) • Motivation • A DMQL can provide the ability to support ad-hoc and interactive data mining • By providing a standardized language like SQL • Hope to achieve a similar effect like that SQL has on relational database • Foundation for system development and evolution • Facilitate information exchange, technology transfer, commercialization and wide acceptance • Design • DMQL is designed with the primitives described earlier

Syntax for DMQL • Syntax for specification of • task-relevant data • the kind of knowledge to be mined • concept hierarchy specification • interestingness measure • pattern presentation and visualization • Putting it all together — a DMQL query

Syntax for task-relevant data specification • use databasedatabase_name, or use data warehouse data_warehouse_name • from relation(s)/cube(s) [where condition] • in relevance to att_or_dim_list • order by order_list • group by grouping_list • having condition

Specification of task-relevant data

Syntax for specifying the kind of knowledge to be mined • Characterization Mine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s) • Discrimination Mine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_classwhere target_condition {versus contrast_class_iwhere contrast_condition_i}analyze measure(s) • Association Mine_Knowledge_Specification::= mine associations [as pattern_name]

Syntax for specifying the kind of knowledge to be mined (cont.) • Classification Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension • Prediction Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}}

Syntax for concept hierarchy specification • To specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension> • We use different syntax to define different type of hierarchies • schema hierarchies define hierarchy time_hierarchy on date as [date,month quarter,year] • set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior

Syntax for concept hierarchy specification (Cont.) • operation-derived hierarchies definehierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) • rule-based hierarchies define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250

Syntax for interestingness measure specification • Interestingness measures and thresholds can be specified by the user with the statement: with <interest_measure_name>threshold = threshold_value • Example: with support threshold = 0.05 with confidence threshold = 0.7

Data Mining

Data Mining

Presentation Transcript

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining: Data

Data Mining

DATA MINING

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data