C huang -K ai C hiou , J udy C. R T seng

A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming • Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International Conference on Machine Learning and Cybernetics Hong Kong, 19-22 August 2007

Outline • Introduction • Apriori Algorithm • DHP Algorithm • MPIP Algorithm • SIT Algorithm • Experiment and Evaluation • Conclusion and Future works

Introduction • Apriori algorithm • Large amount of candidate itemsets will be generated. • Several hash-based algorithms use hash functions to filter out potential-less candidate itemsets. • DHP algorithm • MPIP algorithm • SIT algorithm • Using the sorting, indexing, and trimming techniques to reduce the amount of itemsets to be considered. • Utilizing both the advantages of Apriori and MPIP algorithm.

Apriori Algorithm Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D

DHP Algorithm Database

MPIP Algorithm(1/2) • MPIP employs the minimal perfect hashing function for mining L1and L2. • It copes with the collision problem which occurred in DHP. • The time needed for scanning and searching data items can be reduced. • It employs the Apriori algorithm for finding the frequent k-itemsets for k>2.

MPIP Algorithm(2/2)

SIT Algorithm(1/5) • For mining association rules, we propose a revised algorithm, Sorting-Indexing-Trimming (SIT) approach. • SIT approach can avoid generating potential-less candidate itemsets and enhance the performance via Sorting, Indexing and Trimming.

SIT Algorithm(2/5) • Sorting (1) There is the original transaction database. (2) Count the occurred frequency. (3) Sort the items by the counts in increasing order and build a mapping table. (4) Translate the items into mapping numbers. (5) Re-sort the item ordering in each transaction.

SIT Algorithm(3/5) • Indexing Apriori Indexing Index Table Comparing count=69

SIT Algorithm(4/5) • Trimming • If the minimum support is 3, all the items with frequency less than 3 will be trimmed. • For reserving the data, physical trimming will be avoided. • We just record the starting position, and generate the hash table from this position. L1

SIT Algorithm(5/5) • The processes of SIT algorithm • For finding L1and L2: • Employ the Sorting, Indexing and Trimming techniques to the original database. • Employ MPIP algorithm to find L1 and L2 • For finding the k-itemsets for k>2: • Employ Apriori algorithm to database which has been sorted, indexed and trimmed. • Find out the frequent itemsets.

Experiment and Evaluation(1/2) • The experiments are focus on two parts ： • Performance of Apriori, SI+Apriori, MPIP, and SIT. • Performance of SIT and MPIP under different transaction qualities and length. • Performance of Apriori, SI+Apriori, MPIP, and SIT.

Experiment and Evaluation(2/2) • Performance of SIT and MPIP under different transaction qualities and length. • The time of pre-sorting and pre-indexing are taken into consideration in SIT2.

Conclusion and Future works • SIT reduces the amount of candidate itemsets, and also avoids generating potential-less candidate itemsets. • The performance of SIT is better than Apriori, DHP and MPIP. • Someproblems still need to be dealt with： • When the data sets are increasing, we need to sort and index again for association rule mining. • Mapping itemsinto corresponding index number is time-consuming for the long transaction length.

C huang -K ai C hiou , J udy C. R T seng

C huang -K ai C hiou , J udy C. R T seng

Presentation Transcript

C H A P T E R C H E C K L I S T

C H A P T E R C H E C K L I S T

C H A P T E R C H E C K L I S T

R O C K E T S

T h e R o c k C y c l e

C H A P T E R C H E C K L I S T

Speaker: C. C. Lin Adviser: R. C. T. Lee

C H A P T E R C H E C K L I S T

C H A P T E R C H E C K L I S T

P R O J E C T

A r t A t t a c k

T C J

T h e R o c k C y c l e

C H A P T E R C H E C K L I S T

C H A P T E R C H E C K L I S T

P R O J E C T