Algorithms for Mining Maximal Frequent Itemsets -- A Survey

Algorithms for Mining Maximal Frequent Itemsets -- A Survey Chaojun Lu

Introduction • Frequent Itemset Extension Tree • Common Techniques • Some MFI-Mining Algorithms • Concluding Remarks

Introduction • Terminology and Notations • Problem • Solution

Terminology and Notations set of items: I = { i1, i2, …, in} set of transactions: DB = {T1,T2,…,Tm},Ti I (k-)itemset: N  I ( |N| = k ) support of itemset N: supp(N) frequent itemset (fi) maximal frequent itemset (mfi) set of all frequent (k-)itemsets: FI, FIk set of all mfi: MFI

Problem Discover all maximal frequent itemsets in a given transaction database Solution Traversing the search space -- subset lattice of I -- and count support for itemset in DB

Solution(cont.) • Traversing the search space by -- • Brute-force: 2|I| • Clever use of the Basic Property of itemsets: • A  B  supp(A)  supp(B) • BP1: All subsets of a known frequent itemset are also frequent. • BP2: All supersets of a known infrequent itemset are also infrequent.

Frequent Itemset eXtension Tree • Purpose • Idea • Description • Problem Re-formulated

Purpose To provide a general framework for analyzing and comparing existent MFI mining algorithms. Idea Larger frequent itemsets are generated by extending known smaller frequent itemsets with suitable items. FIXTree captures and illustrates this extension process.

Description of FIXTree • Root:  • Nodes: frequent itemset • Each node N is associated with its candidate extensions CX(N) and frequent extensions FX(N) defined as: • CX(N) = {x | xI and N{x} may be frequent} • FX(N) = {x | xCX(N) and N{x} is frequent} • Parent-Child PC: C is a frequent extension of P, i.e. C = P{x} for some xFX(P).

Example  ({1,2,3,4,5}/{1,2,3,4}) 1 ({2,3,4}/{2,4}) 2 ({3,4}/{3,4}) 3… 4… 23 ({4}/) 12 ({4}/{4}) 14 (/) 24 (/) 124 (/) Problem Re-formulated Generate as small a FIXTree containing MFI as possible while searching the subset lattice of I.

Common Techniques • Search Strategies • Pruning Strategies • Dynamic Reordering • Data Representation for Fast Support Counting • Frequency Determination

Search Strategies • We can generate the FIXTree via: • Breadth-first • Depth-first • Hybrid • For MFI-mining, it’s unnecessary to generate and count all nodes. Instead, we try to generate as fewer nodes of the FIXTree as possible, so long as MFI can be identified.

Pruning Strategies BasicPS1: Prune node N’s infrequent extension subtree. 1 ({2,3,4}/{2,4}) 12 ({4}/{4}) 13 14 (/) Note: This strategy greatly improves a PURE DFS algorithm for mining long patterns.

Pruning Strategies(cont.) BasicPS2: Node N’s CX(N) comes from its parent-node P’s FX(P). Let N=P{x}, xFX(P), then CX(N) = {y | yFX(P) and y > x} 1 ({2,3,4}/{2,4}) 14 (/…) 12 ({4}/…)

Pruning Strategies (cont.) MaxPS1: At node N, if NCX(N)  M (a known fi/mfi), then N-subtree may be pruned. MaxPS2: At node N, if NCX(N) is frequent by support counting, then all N’s children may be pruned ( and a possible new mfi is produced). Look-ahead 1 ({2,3,4}/…) 12 13 14 123 124 1234

Pruning Strategies(cont.) MaxPS3: At node N, NCX(N) is frequent, then all N’s right-hand-side siblings may be pruned. (Those branches won’t produce new mfi.)  ({1,2,3,4,5}/{1,2,3,4}) 1… 2 ({3,4}/…) 3… 4…

Pruning Strategies(cont.) DFMaxPS: In DFS, AFTER the recursive call DFS(Ni), check if the leftmost path N{i,…,n}is frequent. If yes, then Ni’s right-hand-side siblings may be pruned. (These won’t produce new mfi.) N(…/{1,2,…n}) N1 … Ni ({i+1,…,n}) N(i+1) Nn

Pruning Strategies(cont.) EquivPS: At node N, if for some xCX(N), supp(N{x}) = supp(N), then N can be replaced by N{x}, with CX(N{x}) = CX(N)-{x} N ({x,y,z}/…)  Nx ({y,z}/…) Nx… Ny… Nz… Nxy… Nxz… Itemsets containing N but not x cannot be mfi Nxy… Nxz…

Dynamic Reordering • The item order in which to extend itemsets greatly affects MFI mining algorithms • Two heuristics: • DR1 At node N, reorder all xFX(N) in supp(Nx) increasing order. 1 {2,3,4} 13{4} 14 12 {4,3} 124{3} 134 123 1243

Dynamic Reordering(cont.) • DR2 Reorder items of FX() (i.e. FI1) in decreasing order of IF(x) with xFI1, where • IF(x) = {y | yFI1 and xy is infrequent}. • Notes: • |M(x)|  |FI1|-|IF(x)| where M(x) is the size of the longest mfi containing x • DR2 + DR1 for FI1. • Compute FI1 and FI2 before use of DR2.

Data Representation • Data representation • transaction • set of items • bitstring • tid-list for each item(set) • FP-tree • vertical bitmap for each item(set) • diffset • Count support on the entire DB or sub-DB? • Counting techniques

Frequency Determination • We can determine a frequent itemset N via: • Direct counting supp(N) in DB • A known frequent superset of N • Lower Bound of supp(N) exceeding minsup

Lower Bound Technique • Obtain a lower-bound on supp(N) based on support information of N’s subsets. • supp(N{x}) = supp(N)-drop(N,x) •  supp(N)-drop(M,x) where MN. • supp(NX)  supp(N)-drop(M,x) where MN.

Lower Bound Technique(cont.) • LB-PS • We already have supp(N),supp(N1),supp(N2),supp(N3), so we can compute • Supp(N123) = supp(N)-drop(N,1)-drop(N,2)-drop(N,3) and check if it is  minsup? • If yes, then prune N2 and N3 branches. (cf. MaxPS3) N (…/{1,2,3}) N1 ({2,3}/…) N2 ({3}/…) N3

Some MFI-Mining Algorithms • Apriori • Pincer- Search • FP-growth • Max-Miner • DepthProject • MAFIA • GenMax

Apriori Breadth-first Key steps: Given FIk Generate Ck+1 Join (Extending FIk using BasicPS2) Prune (BP2) Support Counting Ck+1 to obtain FIk+1

Apriori(cont.) Symmetry of FI-mining problem  FIk IFk extension Count Ck+1 Count Ck reduction IFk+1 FIk+1 {1,2,…,n} Extension-based vs Reduction-based Frequent vs Infrequent

Pincer-Search Hybrid Search (Top-down + Bottom-up) Key steps: initially CMFI={I} Given FIk-1, Ck , CMFI and MFI Count Ck  CMFI to obtain FIk , IFIk and new MFI Use MFI to prune FIk (BP1, MaxPS) Use IFIk to update CMFI Generate Ck+1 Join (Extending FIk using BasicPS2) Recover missing candidates Prune (BP2)

Pincer-Search(cont.)  topdown 1 2 3 4 5 12 13 14 23 24 34 pruned pruned 1234 bottomup 12345

FP-Growth FP-tree: a compact form of DB/sub-DB Key steps: FP-growth(N,N-tree) if N-tree is a single path N{x,y,z} then a possible mfi is found Nx Ny Nz else { extend N with xFX(N) construct Nx-tree FP-growth(N{x},Nx-tree)}

FP-Growth(cont.)   c:1 f:4 f c a b m p p(mbacf/c) b a c f b:1 m(bacf/acf) b:1 c:3 cp p:1 pruned a:3 p’s subDB:fcam,fcam,cb p’s FP-tree: c m’s subDB: fca,fca,vcab m’s FP-tree: fca b:1 m:2 m:1 p:2

FP-Growth(cont.) Depth-first MaxPS (if used for MFI-mining) Dynamic Reordering Projected subDB Without Candidate Generation? Construct subDB for N  CX(N) Single path  MaxPS Mining frequent 1-itemset in subDB  FX(N)

MaxMiner Breadth-first + Pruning Key Steps: At node N with CX(N) Count NCX(N), N{x} for xCX(N) to get FX(N) If NCX(N) is frequent, prune using MaxPS2 Reorder FX(N) using DR1 Generate N’s children N{x} for xFX(N) with CX(N{x})={y | yFX(N) and y > x} MaxPS3 + LB-PS

DepthProject Depth-first + Pruning Key Steps: At node N with CX(N), call DP(N,DB) Count N{x} in DB to obtain FX(N) Prune using DFMaxPS, MaxPS1 Project DB to obtain subDB (if necessary) Reorder FX(N) using DR1 For each xFX(N): DP(N{x}, subDB) Output: a superset of MFI

DepthProject(cont.) Projected DB DB Proj.DB for {a} a ({b,c}) abc FX(a) bc [101] ab ac acd c abc abe b [1010] bd

DepthProject(cont.) Project DB for some nodes on a path Bitstring representation Byte Counting Bucket Counting

MAFIA Depth-first + Pruning Key Steps: At node N, call MAFIA(N, MFI) If NCX(N) MFI then prune using MaxPS1 Count N{x} obtain FX(N) using EquivPS Reorder FX(N) using DR1 For each xFX(N) MAFIA(N{x}, MFI) If on leftmost path, prune using DFMaxPS

MAFIA(cont.) Data Representation Vertical bitmap and byte counting Bitmap of item(set) N - bmp(N) N N {x} Tran. j 0/1 t(N {x}) = t(N)t(x) bmp(N) AND bmp(x)

GenMax Depth-first + Pruning Key Steps Compute FI1 and FI2 Reorder FI1 using DR2 + DR1 MFI =  used for MaxPS1 LMFI( , FI1, MFI) //use diffsets Return MFI

GenMax(cont.) MFI-subset check: progressive focusing LMFI(N,FX(N),LMFI) For each xFX(N) Generate N{x}with CX(N) If NxCX(Nx) LMFI // MaxPS1 then return Count CX(Nx) to obtain FX(Nx) update LMFI to obtain newLMFI LMFI(Nx, FX(Nx), newLMFI)

GenMax(cont.) MFI-subset check optimization: check for local MFI DR2 Data Representation: diffsets

Concluding Remarks • Independent components can fit together nicely • Search strategy: hybrid • Pruning strategy and dynamic reordering • Data projection, bitmap representation, fast counting, compression • Different algorithms perform well under different MFI distributions • MAFIA and GenMax: current state-of-the-art

References R. C. Agarwal, et al. Depth first generation of long patterns. R. J. Bayardo. Efficiently mining long patterns from databases. D. Burdick, et al. MAFIA: a maximal frequent itemset algorithm for transactional databases. K. Gouda, et al. Efficiently mining maximal frequent itemsets. J. Han, et al. Mining frequent patterns without candidate generation. D-I Lin, et al. Pincer-search: an efficient algorithm for discovering the maximum frequent set.

Thank You!

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

Algorithms for Mining Maximal Frequent Itemsets -- A Survey

Presentation Transcript

Parallel Mining of Maximal Frequent Itemsets form Databases

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets

Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences In Biological Datasets

Mining Frequent Itemsets over Uncertain Databases

The Concept of Maximal Frequent Itemsets

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Fast Algorithms for Mining Frequent Itemsets

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Efficient Algorithms for Mining Share-Frequent Itemsets

Text clustering using frequent itemsets

Fast and Memory Efficient Mining of Frequent Closed Itemsets

Mining Frequent Itemsets over Uncertain Databases

Mining Approximate Frequent Itemsets in the Presence of Noise

Fast Algorithms for Mining Frequent Itemsets

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Frequent Itemsets Mining in Distributed Wireless Sensor Networks

Fast Algorithms for Mining Frequent Itemsets