1 / 15

C huang -K ai C hiou , J udy C. R T seng

A S calable A ssociation R ules M ining A lgorithm B ased on S orting, I ndexing and T rim m ing. C huang -K ai C hiou , J udy C. R T seng. Proceedings of the Sixth International Conference on Machine Learning and Cybernetics Hong Kong, 19-22 August 2007. Outline. Introduction

clover
Download Presentation

C huang -K ai C hiou , J udy C. R T seng

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming • Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International Conference on Machine Learning and Cybernetics Hong Kong, 19-22 August 2007

  2. Outline • Introduction • Apriori Algorithm • DHP Algorithm • MPIP Algorithm • SIT Algorithm • Experiment and Evaluation • Conclusion and Future works

  3. Introduction • Apriori algorithm • Large amount of candidate itemsets will be generated. • Several hash-based algorithms use hash functions to filter out potential-less candidate itemsets. • DHP algorithm • MPIP algorithm • SIT algorithm • Using the sorting, indexing, and trimming techniques to reduce the amount of itemsets to be considered. • Utilizing both the advantages of Apriori and MPIP algorithm.

  4. Apriori Algorithm Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D

  5. DHP Algorithm Database

  6. MPIP Algorithm(1/2) • MPIP employs the minimal perfect hashing function for mining L1and L2. • It copes with the collision problem which occurred in DHP. • The time needed for scanning and searching data items can be reduced. • It employs the Apriori algorithm for finding the frequent k-itemsets for k>2.

  7. MPIP Algorithm(2/2)

  8. SIT Algorithm(1/5) • For mining association rules, we propose a revised algorithm, Sorting-Indexing-Trimming (SIT) approach. • SIT approach can avoid generating potential-less candidate itemsets and enhance the performance via Sorting, Indexing and Trimming.

  9. SIT Algorithm(2/5) • Sorting (1) There is the original transaction database. (2) Count the occurred frequency. (3) Sort the items by the counts in increasing order and build a mapping table. (4) Translate the items into mapping numbers. (5) Re-sort the item ordering in each transaction.

  10. SIT Algorithm(3/5) • Indexing Apriori Indexing Index Table Comparing count=69

  11. SIT Algorithm(4/5) • Trimming • If the minimum support is 3, all the items with frequency less than 3 will be trimmed. • For reserving the data, physical trimming will be avoided. • We just record the starting position, and generate the hash table from this position. L1

  12. SIT Algorithm(5/5) • The processes of SIT algorithm • For finding L1and L2: • Employ the Sorting, Indexing and Trimming techniques to the original database. • Employ MPIP algorithm to find L1 and L2 • For finding the k-itemsets for k>2: • Employ Apriori algorithm to database which has been sorted, indexed and trimmed. • Find out the frequent itemsets.

  13. Experiment and Evaluation(1/2) • The experiments are focus on two parts : • Performance of Apriori, SI+Apriori, MPIP, and SIT. • Performance of SIT and MPIP under different transaction qualities and length. • Performance of Apriori, SI+Apriori, MPIP, and SIT.

  14. Experiment and Evaluation(2/2) • Performance of SIT and MPIP under different transaction qualities and length. • The time of pre-sorting and pre-indexing are taken into consideration in SIT2.

  15. Conclusion and Future works • SIT reduces the amount of candidate itemsets, and also avoids generating potential-less candidate itemsets. • The performance of SIT is better than Apriori, DHP and MPIP. • Someproblems still need to be dealt with: • When the data sets are increasing, we need to sort and index again for association rule mining. • Mapping itemsinto corresponding index number is time-consuming for the long transaction length.

More Related