1 / 18

Parallel Mining of Closed Sequential Patterns

Parallel Mining of Closed Sequential Patterns. Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining Chicago, Illinois, USA, 2005 Advisor : Jia-Ling Koh

qiana
Download Presentation

Parallel Mining of Closed Sequential Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining Chicago, Illinois, USA, 2005 Advisor:Jia-Ling Koh Speaker:Chun-Wei Hsieh

  2. Introduction • Numerous applications: • DNA sequences, Analysis of web log, customer shopping sequences, XML query access patterns… • Closed Sequential patterns • have All information • are more compact • Many applications are time-critical and involve huge volumes of data.

  3. Sequential Algorithm-BIDE • Step 1: Identify the frequent 1-sequences • Step 2: Project the dataset along each frequent 1-sequence • Step 3: Mine each resulting projected dataset

  4. The projected dataset forsequence AB is {C,CB,C,BCA}. Sequential Algorithm-BIDE

  5. Task Decomposition • 1. Each processor counts the occurrence of 1-sequences in a different part of the dataset. A global add reduction is executed to obtain the overall counts. • 2. Build pseudoprojections. This is done in parallel by assigning a different part of the dataset to each processor. The pseudo-projections are communicated to all processors via an all-to-all broadcast. • 3. Dynamic scheduling to distribute the processing of the projections across processors.

  6. Task Decomposition • In the second step, it is more efficient to implement the broadcast using a virtual ring structure. • Assume there are N processor, and Processor K • Only receives the package from Processor ((K-1) mod N) • Only Sends the package to Processor ((K+1) mod N) • It needs (N-1) send-receive steps and consumes no more than 0.5% of the mining time.

  7. Task Scheduling • 1. A master processor maintains a queue of pseudo- projection identifiers. Other processors is initially assigned a projection. • 2. After mining a projection, a processor sends a request to the master processor for another projection. • 3. This process continues until the queue of projections is empty.

  8. Task Scheduling • If the largest subtask takes 25% of the total mining time, the best possible speedup is only 4 regardless of the number of processors available. • To improve the dynamic scheduling, the approach is to find which projections require long mining time, and to decompose them.

  9. Relative Mining Time Estimation • Random sampling • selects random subset of the projections • is not accurate if the overhead is kept small • Selective sampling • uses every sequence of the projections • discards infrequent 1-sequences and the last L frequent 1-sequences ( L = a given fraction t * the average length of the sequences in the dataset )

  10. Selective sampling • For example, • assume (A : 4), (B : 4), (C : 4), (D :3), (E : 3), (F : 3), (G : 1) are the 1-sequences • the support threshold = 4 • the average length of the sequences in the dataset = 4 • Suppose t = 75% • L = 4 ∗ 0 .75 = 3 • Given a sequence as AABCACDCFDB, selective sampling will reduce this sequence to AABCA

  11. Relative Mining Time Estimation

  12. Par-CSP Algorithm

  13. Experiments • 64 nodes • OS: Redhat Linux 7.2 • CPU: 1GHz Intel Pentium 3 • RAM: 1GB • Compiler: GNU g++ 2.96

  14. Experiments • Synthetic Dataset: IBM dataset generator • Real Dataset: Gazelle, Web click-stream

  15. Experiments

  16. Experiments

  17. Experiments

  18. Experiments

More Related