1 / 66

Association Rule Mining

Association Rule Mining. ARM http://www.cs.ndsu.nodak.edu/~rahal/765/lectures/. Lecture Outline. Data Mining and Knowledge Discovery Market Basket Research Models Association Rule Mining Apriori Rule Generation Methods To Improve Apriori’s Efficiency Vertical Data Representation.

havily
Download Presentation

Association Rule Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Association Rule Mining ARM http://www.cs.ndsu.nodak.edu/~rahal/765/lectures/

  2. Lecture Outline • Data Mining and Knowledge Discovery • Market Basket Research Models • Association Rule Mining • Apriori • Rule Generation • Methods To Improve Apriori’s Efficiency • Vertical Data Representation

  3. What is Data Mining • Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns and knowledge in data. • Valid: The patterns hold in general. • Fargo is in Minnesota ! • Novel: We did not know the pattern beforehand. • (live in Fargo)  (live in ND) • Useful: We can devise actions from the patterns (actionable) • Understandable: We can interpret and comprehend the patterns.

  4. What Motivated Data Mining? • As an evolution in the path of IT • 1-Data Collection and Database Creation • Primitive File Processing • 1960s and earlier • 2-Database Management Systems: • Hierarchical/Network/Relational database system • ERDs • SQL • Recovery and concurrency control in DBMSs • OLTP • 1970s-early 1980s

  5. 3.1-Advanced Database Systems • Object-oriented/object-relational databases • Application-oriented databases • Spatial, multimedia, scientific, etc … • Mid-1980s-present • 3.2-Web-based Database Systems • XML-based databases systems • Web analysis and mining • Semantic Web (the whole web as a single XML database) • Mid-1990s-present

  6. 3.3-Data Warehousing and Data Mining • Multi-dimensional Data warehouse and OLAP technology • Data Mining and Knowledge Discovery • tools to assist people in their decision-making processes • Late 1980s-present

  7. Why Use Data Mining Today? • Market Competition Pressure! • “The secret of success is to know something that nobody else knows.” Aristotle • Wal-Mart VS K-Mart • Right products, right place, right time, and right quantities • Personalization, CRM • Security, homeland defense • Analysis of important application data • Bioinformatics • Stock market data

  8. Human analysis skills are inadequate: • Volume and dimensionality of the data • High data growth rate • Storage • Computational power • Off-the-shelf software • Other factors

  9. Where Could All Of This Data Be Coming From? • Supermarket scanners • Preferred customer cards • Sunmart’s MoreCards • Credit card transactions • Call center records • ATM machines • Demographic data • Sensor networks • Cameras • Web server logs • Customer web site trails • Biological data (e.g. MicroArray Experiments for expression levels) • Image data

  10. Types Of Data/Information Repositories For Data Mining • By definition, data mining should be applicable to any kind of information repository • Flat files • Relational databases • data warehouses • transactional databases • Advanced database systems • object-oriented • Object-relational

  11. Application-oriented databases • Multimedia • Text • Image • Video • Audio • Heterogeneous databases • Appear as centralized • Independent components managing different parts of the data

  12. How Could We Describe Data • Numerical : Domain is ordered and can be represented on the continuous real line (e.g. age, income) • Continuous? • Nominal or categorical : Domain is a finite set without any natural ordering (e.g. occupation, marital status, race) • Ordinal : Domain is finite and ordered, (e.g.: grade scale, months in a year)

  13. The Knowledge Discovery Process • Broader than Data Mining • Steps: • Identify the problem • Data mining • Action • Evaluation and measurement • Deployment and integration into real-life processes and/or applications

  14. The Data Mining Step in More Detail • Cleaning and integration of various data sources • Remove noise and outliers • Missing Values (e.g. null values) • Noisy data (errors) • Inconsistent Data (integration) • FirstName and F_Name • Selection and transformation of relevant data into appropriate forms • Focus on fields of interest • Education on salary • Create common units • Height in cm and inches • Generate new fields • Discovery of interesting patterns from the data • Pattern evaluation to identify the interesting patterns based on some predefined measures • Knowledge presentation to communicate the mined knowledge and information to the user mostly through visualization techniques to provide a better view

  15. This process can be repeated as needed • Data mining systems are expected to handle large amounts of data • Analysis of small datasets is sometimes called • machine learning • SDA – Statistical data analysis. • In other words, data mining must be scalable to large data sets • Scalability and efficiency

  16. Data Mining Knowledge Patterns PreprocessedData Pattern evaluation TargetData Knowledge presentation Discovery Original Data Selection and transformation Cleaning and integration

  17. Data Mining Tasks • Characterization • the process of summarizing the general characteristics and features of a specific class of data (usually referred to as the target class) • Characterizing the items in a store whose sales have decreased by 50% over a certain period of time. • There maybe some common characteristics to all those items which we would like to uncover. • Produced by a no-longer trusted producer

  18. Discrimination • Discrimination is very similar to characterization in that it reveals the characteristics of a target class in comparison to those characteristics pertaining to one or more other classes. • The target and contrasting classes are specified by user and their data is retrieved from the database before the discrimination process starts. • As an example, a user might want to discriminate between the characteristics of the items in a store whose • sales have increased by 10% over a certain period of time this year • sales have increased by 10% over the same period of time last year.

  19. Association Rule Mining • The process of discovering association rules among attribute values that exist in a given set of data. • Market basket research (MBR) where users are usually interested in mining associations between items in a store by using daily transactions. • An example of a rule might be diapersbeermeaning that customers buying diapers are very likely to buy beer. • This will give us a good pointer to place diapers next to beer so as to increase sales • sometimes people wonder about the strange placement of products in large stores • Maternity to infant

  20. Classification • The process of using a set of training data with known class labels to come up with a model (or function) that predicts the unknown class label of new samples. • An example of classification can be found in the banking industry • customer characteristics like age, annual income, marital status, etc are used to predict the possibility of approving loan applications (the loan status is the class label). • In an initial step, a dataset containing a certain number of customers with known class labels is used to create a classifier which can then be used to predict the class label of a new application • ANN • Classification is very similar to regression except that the later is applicable to numerical data while the former is applicable to categorical and numerical data.

  21. Clustering • The is process of grouping data objects into clusters such that • intra-cluster similarity is maximized • inter-cluster similarity is minimized. • In other words, objects within the same clusters are very similar and objects in different clusters are not. • E.g. studying collective properties of people at different income levels • Cluster people based on incomes • Study common properties within clusters • Lower income related to lower education

  22. Outlier detection • Through clustering, we can find groups of objects that behave similarly • sometimes, we are only interested in those objects that lie scattered around without behaving similarly to any pattern existing in the data. • Those objects are known as outliers as they do not adhere to the patterns defined by the rest of the objects in the dataset. • Outlier detection is usually desired in applications where abnormal behavior is • of interest such as intrusion detection in networks or terrorist detection in ports of entry • not of interest, such as when we clean a dataset from noise

  23. Similarity searches • given a database of objects, and a “query” object, • find all similar objects (neighbours) • Google search • Given a query which a small document • Find all similar documents • Ranked order them

  24. Final Notes on Data Mining • Forms the center of a set of research fields and applications dealing with data analysis: • databases, statistics, machine learning, artificial intelligence, information sciences/technology and the like • at the same time introduces a lot of new features rendering itself as a separate science. • scalability to large datasets

  25. Not all types of patterns mined by data mining systems are interesting. • Subjective and objective interesting measures.

  26. Market Basket Research • We will mainly use the Market Basket Research (MBR) application in our ARM description • A large set of items, e.g. products sold in a supermarket. • A large set of transactions or baskets, each of which contains a small set of the items (called an itemset) bought by a customer during a single visit to a store.

  27. TID Atts 1 a b c 2 a b d 3 a b e 4 a c d 5 a c e 6 a d e 7 b c d 8 b c e 9 b d e 10 c d e • The Set Model • Data is organized as a "TRANSACTION TABLE" with 2 attributes: TT(Tid, Itemset) • A transaction is a customer transaction at a cash register. • Each customer is given an identifier, Tid, for every transaction made • Itemset is the set of items in the customer's "basket". • Note that tuples in TT are not "flat" (each itemset is a "set") • i.e. not relational (why?) • a transformation can be made to equivalent but normalized models

  28. The Normalized Set Model • Data is organized as a “NORMALIZED TRANSACTION TABLE" with 2 attributes: NTT(Tid,Iid) • An itemset is the group of items belonging to the same transaction • The TT(Tid, ItemSet) can be "transformed" to NTT(Tid, Iid) and vice versa • Could be stored in a database • Very deep (10 to 30 tuples)

  29. The Boolean Model: "Boolean Transaction Table“: • BTT(Tid, Item-1, Item-2,... Item-n) • Tid is a transaction identifier • Each column is a particular Item (1 column for each item) • a 1  if item is in the basket • a 0  if item is not in the basket • TT, NTT and BTT are equivalent • This is the model mostly chosen for ARM

  30. Association Rule Mining • Association Rule Mining (ARM) finds interesting associations and/or correlation relationships among large sets of data items. • Association rules provide information in the form of "if-then" statements. • These rules are • computed from the data • unlike the if-then rules of logic, association rules are probabilistic in nature • strength could be measured

  31. An association rule defines a relationship of the form: • A C (if A then C) • Read as A implies C, whereA and Care sets of items in a data set. • A called antecedent and C the consequent • Given DB, ARM finds all the ARs

  32. D = A data set comprising n records (transactions) and m Boolean valued attributes (BTT model) • I = The set of m attributes, {i1,i2, … ,im}, represented in D. • Itemset = Some subset of I. Each record in D is an itemset • For all rules AC:AI, CI, and AC= (A and C are disjoint).

  33. TID Atts 1 a b c 2 a b d 3 a b e 4 a c d 5 a c e 6 a d e 7 b c d 8 b c e 9 b d e 10 c d e An Example DB • Items = 5 • I = {a,b,c,d,e} • Transactions = 10 • D = {{a,b,c}, {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {a,d,e}, {b,c,d}, {b,c,e}, {b,d,e}, {c,d,e}}

  34. Support of an Itemset • Support of an itemset IS is the number of transactions in D containing all items in IS (support of IS={ab} is 3?) • Given a support threshold s, sets of items that appear in >s transactions are called frequent itemsets • The process is called frequent itemset mining

  35. Items={m=milk, c=cheese, p=pepsi, b=bread, j=juice}. • Support threshold = 3 transactions. T1 = {m, c, b} T2 = {m, p, j} T3 = {m, b} T4 = {c, j} T5 = {m, p, b} T6 = {m, c, b, j} T7 = {c, b, j} T8 = {b, c} • Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.

  36. Support and Confidence of a Rule AC • Support of an itemset IS is the number of transactions containing all items in IS • Itemsets are used to derive rules • Support of a rule R: AC is the number of transactions in D containing all items in A U C. • Frequent rule • Significance of a rule • Confidence of a rule is Support(R)/ Support(A) • Confident rule • Strength of a rule • Out of those containing A, how many also contain C • Frequent + Confident  Strong

  37. Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • An association rule: {m, b}  c. • What is the confidence? • support(m, b, c) = 2 • Support(m, b) = 4 • Confidence = 2/4 = 50%. • And so what does that mean? • 50% that contain {m, b} also contain c

  38. More On The Problem Definition • ARM is a two-step process: • Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support threshold • Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy the minimum support and minimum confidence thresholds • A typical question: “find all strong association rules with support >s and confidence >c.” • Given a database D • Findall frequent itemsets (F) using s • Produce all strong association rules using c

  39. Finding F is the most computationally expensive part, once we have the frequent sets generating ARs is straight forward

  40. The Anti-Monotonicity (downward-closure) of Support • Naïve: generate all subset itemsets of I and test each • The number of potential subset itemsets 2m • If m=5, #potential itemsets = 32 • If m=20, #potential itemsets 1,048,576 • Imagine what would supermarkets have? m = 10,000? • Conclusion? • Naïve approach is infeasible • Breakthrough: If an itemset A has support greater than s then all its subsets must also be have support greater than s • example • Alternatively if an itemset A is not frequent then none of its supersets will be supported. • Proposed by Agrawal 1993 from IBM Almaden Research Center…its started ARM and the field of data mining

  41. Apriori • Proposed by Agrawal • Apriori • Uses the downward-closure of support to reduce the number of itemsets that need to be counted (called candidate frequent itemsets C) • Works on a level-by-level basis (i.e. uses frequent itemsets L from the previous to generate frequent itemsets at this level) • Ckand Lk • At every level k generates Ck from Lk-1and counts their frequency in the database

  42. Two steps are performed to generate Ck • Join Step: Ckis generated by joining Lk-1with itself • Prune Step: all itemsets in Ck whose k-1 subsets are not ALL frequent (i.e. present in Lk-1)are removed • How many subsets does an itemset of size k have? • 2k • E.g. k=3 • How many subsets of size k-1 does an itemset of size k have? • k

  43. The Apriori Algorithm • Pseudo-code:Ck: Candidate frequent itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do beginCk+1 = candidates generated from Lk; Remove any itemset from Ck+1 that has at least one infrequent k subset for each transaction t in database do increment the counts of all candidates in Ck+1 that are contained in t (count the frequency of each itemset in Ck+1)Lk+1 = candidates in Ck+1 with min_supportendreturnkLk;

  44. Example of Generating Candidates • Suppose the items in all itemsets are listed in some order • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • Combine any two itemsets in Lk if they only differ by the last item • abcd from abc and abd • acde from acd and ace • C4 = {abcd , acde} • Pruning: • abcd: abc, abd, acd, bcd • acde: acd, ace, ade, cde • C4={abcd}

  45. How To Generate Candidates? Lk Ck+1 • Step 1: self-joining Lk • insert intoCk+1select p.item1, p.item2, …, p.itemk, q.itemkfromLkp, Lkqwhere p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk • Step 2: pruning forallitemsets c in Ck+1do forallk-subsets s of c do if (s is not in Lk)then delete c from Ck+1

  46. Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D An Example – Support 2

  47. Generation of Association Rules • Given all frequent itemsets • Every frequent itemset I of size > 2 is divided into a candidate head Y and a body X • such that X intersection Y = {}. • This process starts with Y = {}, resulting in the rule I  {} • always holds with 100% confidence (why?) • After that, the algorithm iteratively generates candidate heads of size k + 1, starting with k = 0

  48. Is Apriori Fast Enough? Performance Bottlenecks • The core of the Apriori algorithm: • Uses frequent (k – 1)-itemsets to generate candidate frequent k-itemsets • Uses databases scan to collect counts for the candidate itemset – 1 scan per level • The bottleneck of Apriori: candidate generation • Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. • Multiple scans of database: • Needs n scans, n is the length of the longest pattern • One scan per level

More Related