Association Rules Mining with SQL

Association Rules Mining with SQL Kirsten Nelson Deepen Manek November 24, 2003

Organization of Presentation • Overview – Data Mining and RDBMS • Loosely-coupled data and programs • Tightly-coupled data and programs • Architectural approaches • Methods of writing efficient SQL • Candidate generation, pruning, support counting • K-way join, SubQuery, GatherJoin, Vertical, Hybrid • Integrating taxonomies • Mining sequential patterns

Early data mining applications • Most early mining systems were developed largely on file systems, with specialized data structures and buffer management strategies devised for each • All data was read into memory before beginning computation • This limits the amount of data that can be mined

Advantage of SQL and RDBMS • Make use of database indexing and query processing capabilities • More than a decade spent on making these systems robust, portable, scalable, and concurrent • Exploit underlying SQL parallelization • For long-running algorithms, use checkpointing and space management

Use of Database in Data Mining • “Loose coupling” of application and data • How would you write an Apriori program? • Use SQL statements in an application • Use a cursor interface to read through records sequentially for each pass • Still two major performance problems: • Copying of record from database to memory • Process context switching for each record retrieved

Tightly-coupled applications • Push computations into the database system to avoid performance degradation • Take advantage of user-defined functions (UDFs) • Does not require changes to database software • Two types of UDFs we will use: • Ones that are executed only a few times, regardless of the number of rows • Ones that are executed once for each selected row

Tight-coupling using UDFs Procedure TightlyCoupledApriori(): begin exec sql connect to database; exec sql select allocSpace() into :blob from onerecord; exec sql select * from sales where GenL1(:blob, TID, ITEMID) = 1; notDone := true;

Tight-coupling using UDFs while notDone do { exec sql select aprioriGen(:blob) into :blob from onerecord; exec sql select * from sales where itemCount(:blob, TID, ITEMID)=1; exec sql select GenLk(:blob) into :notDone from onerecord }

Tight-coupling using UDFs exec sql select getResult(:blob) into :resultBlob from onerecord; exec sql select deallocSpace(:blob) from onerecord; compute Answer using resultBlob; end

Methodology • Comparison done with Association Rules against IBM DB2 • Only consider generation of frequent itemsets using Apriori algorithm • Five alternatives considered: • Loose-coupling through SQL cursor interface – as described earlier • UDF tight-coupling – as described earlier • Stored-procedure to encapsulate mining algorithm • Cache-mine – caching data and mining on the fly • SQL implementations to force processing in the database • Consider two classes of implementations • SQL-92 – four different implementations • SQL-OR (with object relational extensions) – six implementations

Architectural Options • Stored procedure • Apriori algorithm encapsulated as a stored procedure • Implication: runs in the same address space as the DBMS • Mined results stored back into the DBMS. • Cache-mine • Variation of stored-procedure • Read entire data once from DBMS, temporarily cache data in a lookaside buffer on a local disk • Cached data is discarded when execution completes • Disadvantage – requires additional disk space for caching • Use Intelligent Miner’s “space” option

Terminology • Use the following terminology • T: table of items • {tid,item} pairs • Data is normally sorted by transaction id • Ck: candidate k-itemsets • Obtained from joining and pruning frequent itemsets from previous iteration • Fk: frequent items sets of length k • Obtained from Ck and T

Candidate Generation in SQL – join step • Generate Ck from Fk-1 by joining Fk-1 with itself insert into Ck select I1.item1,…,I1.itemk-1,I2.itemk-1 from Fk-1 I1,Fk-1 I2 where I1.item1 = I2.item1 and … I1.itemk-2 = I2.itemk-2 and I1.itemk-1 < I2.itemk-1

Candidate Generation Example • F3 is {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}} • C4 is {{1,2,3,4},{1,3,4,5}} Table F3 (I1) Table F3 (I2)

Pruning • Modify candidate generation algorithm to ensure all k subsets of Ck of length (k-1) are in Fk-1 • Do a k-way join, skipping itemn-2 when joining with the nth table (2<n≤k) • Create primary index (item1, …, itemk-1) on Fk-1 to efficiently process k-way join • For k=4, this becomes insert into C4 select I1.item1, I1.item2, I1.item3,I2.item3 from F3 I1,F3 I2, F3 I3, F3 I4where I1.item1 = I2.item1 … and I1.item3 < I2.item3 and I1.item2 = I3.item1 and I1.item3 = I3.item2 and I2.item3 = I3.item3 and I1.item1 = I4.item1 and I1.item3 = I4.item2 and I2.item3 = I4.item3

Pruning Example • Evaluate join with I3 using previous example • C4 is {1,2,3,4} Table F3 (I1) Table F3 (I2) Table F3 (I3)

Support counting using SQL • Two different approaches • Use the SQL-92 standard • Use ‘standard’ SQL syntax such as joins and subqueries to find support of itemsets • Use object-relational extensions of SQL (SQL-OR) • User Defined Functions (UDFs) & table functions • Binary Large Objects (BLOBs)

Support Counting using SQL-92 • 4 different methods, two of which detailed in the papers • K-way Joins • SubQuery • Other methods not discussed because of unacceptable performance • 3-way join • 2 Group-Bys

SQL-92: K-way join • Obtain Fk by joining Ck with table T of (tid,item) • Perform group by on the itemset insert into Fk select item1,…,itemk,count(*) from Ck, T t1, …, T tk, where t1.item = Ck.item1, … , and tk.item = Ck.itemk and t1.tid = t2.tid … and tk-1.tid = tk.tid group by item1,…,itemk having count(*) > :minsup

K-way join example • C3={B,C,E} and minimum support required is 2 • Insert into F3 {B,C,E,2}

K-way join: Pass-2 optimization • When calculating C2, no pruning is required after we join F1 with itself • Don’t calculate and materialize C2- replace C2 in 2-way join algorithm with join of F1 with itself insert into F2 select I1.item1, I2.item1,count(*) from F1 I1, F1 I2, T t1, T t2 where I1.item1 < I2.item1 and t1.item = I1.item1 and t2.item = I2.item1 and t1.tid = t2.tid group by I1.item1,I2.item1 having count(*) > :minsup

SQL-92: SubQuery based • Split support counting into cascade of k subqueries • nth subquery Qn finds all tids that match the distinct itemsets formed by the first n items of Ck insert into Fk select item1, …, itemk, count(*) from (Subquery Qk) t Group by item1, item2 … , itemk having count(*) > :minsup Subquery Qn (for any n between 1 and k): select item1, …, itemn, tid from T tn, (Subquery Qn-1) as rn-1 (select distinct item1, …, itemn from CK) as dn where rn-1.item1 = dn.item1 and … and rn-1.itemn-1 = dn.itemn and rn-1.tid = tn.tid and tn.item = dn.itemn

Example of SubQuery based • Using previous example from class • C3 = {B,C,E}, minimum support = 2 • Q0: No subquery Q0 • Q1 in this case becomes select item1, tid From T t1, (select distinct item1from C3) as d1 where t1.item = d1.item1

Example of SubQuery based cnt’d • Q2 becomes select item1, item2, tid from T t2, (Subquery Q1) as r1, (select distinct item1, item2 from C3) as d2 where r1.item1 = d2.item1 and r1.tid = t2.tid and t2.item = d2.item2

Example of SubQuery based cnt’d • Q3 becomes select item1,item2,item3, tid from T t3, (Subquery Q2) as r2, (select distinct item1,item2,item3 from C3) as d3 where r2.item1 = d3.item1 and r2.item2 = d3.item2 and r2.tid = t3.tid and t3.item = d3.item3

Example of SubQuery based cnt’d • Output of Q3 is • Insert statement becomes insert into F3 select item1, item2, item3, count(*) from (Subquery Q3) t group by item1, item2 ,item3 having count(*) > :minsup • Insert the row {B,C,E,2} • For Q2, pass-2 optimization can be used

Performance Comparisons of SQL-92 approaches • Used Version 5 of DB2 UDB and RS/6000 Model 140 • 200 Mhz CPU, 256 MB main memory, 9 GB of disk space, Transfer rate of 8 MB/sec • Used 4 different item sets based on real-world data • Built the following indexes, which are not included in any cost calculations • Composite index (item1, …, itemk) on Ck • k different indices on each of the k items in Ck • (item,tid) and (tid,item) indexes on the data table T

Performance Comparisons of SQL-92 approaches • Best performance obtained by SubQuery approach • SubQuery was only comparable to loose-coupling in some cases, failing to complete in other cases • DataSet C, for support of 2%, SubQuery outperforms loose-coupling but decreasing support to 1%, SubQuery takes 10 times as long to complete • Lower support will increase the size of Ck and Fk at each step, causing the join to process more rows

Support Counting using SQL with object-relational extensions • 6 different methods, four of which detailed in the papers • GatherJoin • GatherCount • GatherPrune • Vertical • Other methods not discussed because of unacceptable performance • Horizontal • SBF

SQL Object-Relational Extension: GatherJoin • Generates all possible k-item combinations of items contained in a transaction and joins them with Ck • An index is created on all items of Ck • Uses the following table functions • Gather: Outputs records {tid,item-list}, with item-list being a BLOB or VARCHAR containing all items associated with the tid • Comb-K: returns all k-item combinations from the transaction • Output has k attributes T_itm1, …, T_itmk

GatherJoin insert into Fk select item1,…, itemk, count(*) from Ck, (select t2.T_itm1,…,t2.itmk from T, table(Gather(T.tid,T.item)) as t1, table(Comb-K(t1.tid,t1.item-list)) as t2) where t2.T_itm1 = Ck.item1 and … and t2.T_itmk = Ck.itemk group by Ck.item1,…,Ck.itemk having count(*) > :minsup

Example of GatherJoin • t1 (output from Gather) looks like: • t2 (generated by Comb-K from t1) will be joined with C3 to obtain F3 • 1 row from Tid 10 • 1 row from Tid 20 • 4 rows from Tid 30 • Insert {B,C,E,2}

GatherJoin: Pass 2 optimization • When calculating C2, no pruning is required after we join F1 with itself • Don’t calculate and materialize C2 - replace C2 with a join to F1 before the table function • Gather is only passed frequent 1-itemset rows insert into F2 select I1.item1, I2.item1, count(*) from F1 I1, (select t2.T_itm1,t2.T_itm2 from T, table(Gather(T.tid,T.item)) as t1, table(Comb-K(t1.tid,t1.item-list)) as t2 where T.item = I1.item1) group by t2.T_itm1,t2.T_itm2 having count(*) > :minsup

Variations of GatherJoin - GatherCount • Perform the GROUP BY inside the table function Comb-K for pass 2 optimization • Output of the table function Comb-K • Not the candidate frequent itemsets (Ck) • But the actual frequent itemsets (Fk) along with the corresponding support • Use a 2-dimensional array to store possible frequent itemsets in Comb-K • May lead to excessive memory use

Variations of GatherJoin - GatherPrune • Push the join with Ck into the table function Comb-K • Ck is converted into a BLOB and passed as an argument to the table function. • Will have to pass the BLOB for each invocation of Comb-K - # of rows in table T

SQL Object-Relational Extension: Vertical • For each item, create a BLOB containing the tids the item belongs to • Use function Gather to generate {item,tid-list} pairs, storing results in table TidTable • Tid-list are all in the same sorted order • Use function Intersect to compare two different tid-lists and extract common values • Pass-2 optimization can be used for Vertical • Similar to K-way join method • Upcoming example does not show optimization

Vertical insert into Fk select item1, …, itemk, count(tid-list) as cnt from (Subquery Qk) t where cnt > :minsup Subquery Qn (for any n between 2 and k) Select item1, …, itemn, Intersect(rn-1.tid-list, tn.tid-list) as tid-list from TidTable tn, (Subquery Qn-1) as rn-1 (select distinct item1, …, itemn from CK) as dn where rn-1.item1 = dn.item1 and … and rn-1.itemn-1 = dn.itemn-1 and tn.item = dn.itemn Subquery Q1: (select * from TidTable)

Example of Vertical • Using previous example from class • C3 = {B,C,E}, minimum support = 2 • Q1 is TidTable

Example of Vertical cnt’d • Q2 becomes Select item1, item2, Intersect(r1.tid-list, t2.tid-list) as tid-list from TidTable t2, (Subquery Q1) as r1 (select distinct item1, item2 from C3) as d2 where r1.item1 = d2.item1 and t2.item = d2.item2

Example of Vertical cnt’d • Q3 becomes select item1, item2, item3, intersect(r2.tid-list, t3.tid-list) as tid-list from TidTable t3, (Subquery Q2) as r2 (select distinct item1, item2, item3 from C3) as d3 where r2.item1 = d3.item1 and r2.item2 = d3.item2 and t3.item = d3.item3

Performance Comparisons using SQL-OR

Performance comparison of SQL object-relational approaches • Vertical has best overall performance, sometimes an order of magnitude better than other 3 approaches • Majority of time is transforming the data in {item,tid-list} pairs • Vertical spends too much time on the second pass • Pass-2 optimization has huge impact on performance of GatherJoin • For Dataset-B with support of 0.1 %, running time for Pass 2 went from 5.2 hours to 10 minutes • Comb-K in GatherJoin generates large number of potential frequent itemsets we must work with

Hybrid approach • Previous charts and algorithm analysis show • Vertical spends too much time on pass 2 compared to other algorithms, especially when the support is decreased • GatherJoin degrades when the # of frequent items per transaction increases • To improve performance, use a hybrid algorithm • Use Vertical for most cases • When size of candidate itemset is too large, GatherJoin is a good option if number of frequent items per transaction (Nf) is not too large • When Nf is large, GatherCount may be the only good option

Architecture Comparisons • Compare five alternatives • Loose-Coupling, Stored-procedure • Basically the same except for address space program is being run in • Because of limited difference in performance, focus solely on stored procedure in following charts • Cache-Mine • UDF tight-coupling • Best SQL approach (Hybrid)

Performance Comparisons of Architectures

Association Rules Mining with SQL

Association Rules Mining with SQL

Presentation Transcript

Data Mining Association Rules

Mining Association Rules

Mining Association Rules

DATA MINING - ASSOCIATION RULES-

Mining Association Rules

Chapter 2: Mining Association Rules

Mining Association Rules in Large Databases

Mining Causal Association Rules

Data Mining Association Rules

Association Rules Mining

Mining Association Rules with Constraints

Incremental Mining Association Rules

Incremental Mining of Association Rules

Mining Association Rules from Stars

Mining Non-Derivable Association Rules

Mining Generalized Association Rules

Algorithms for Mining Association Rules

Chapter 5 Mining Association Rules with FP Tree

Data Mining-Association Rules and Clustering

Mining Negative Association Rules

Introduction to Data Mining Mining Association Rules

Incremental Mining of Association Rules