1 / 18

Mining Sequential Patterns

Mining Sequential Patterns. Dimitrios Gunopulos, UCR. Finding Frequent Sequential Patterns. The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. For example:

helene
Download Presentation

Mining Sequential Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Sequential Patterns Dimitrios Gunopulos, UCR

  2. Finding Frequent Sequential Patterns • The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. • For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W. • Applications in telecommunication data, networks, biology

  3. time t1 Later time t2 item item attribute value itemset item Sequences • Sequence ((T=90F)  (H=60%, P=1.1atm)) • k-sequence: sequence with k items • T1H2P1T3P2, P1T2H4P2T5: 5-sequences • S1 is subsequence of S2 (S1  S2) • T1P1T2  H1T1P2H2P1T2 (T1H1T1 , P1T2H2P1T2) • H1P1T2  H1T1P2H2P1T2

  4. Sequential Patterns: The Problem • support or frequency of a sequence S ((S)): • = the total number of times sequence S is encountered • user specified minimum threshold min_sup • S is frequent  (S) min_sup • S:maximal frequent sequenceS is frequent and all of its supersequences are non-frequent • S:minimal non-frequent sequenceS is non frequent and all of its subsequences are frequent • The problem • Given: database D and min_sup • the problem: find all frequent sequences in D

  5. Example Database

  6. Algorithms for Sequential Patterns • Apriori, GSP [Srikant, Agrawal, EDBT 1996] [Mannila, Toivonen, Verkamo, DMKD 1997] • SPADE, Parallel Spade [Zaki, 2001] • FreeSpan, PrefixSpan [Han et al, SIGKDD 2000], [Pei et al, ICDE 2001] • Sequential Patterns with constraints [Garofalakis et al, VLDB 99] • DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]

  7. The Lattice Structure • Lemma: All subsequences of a frequent sequence are frequent

  8. SPADE ([Zaki, 2001]) • Lattice-based approach • vertical id-list format • enumerates all frequent sequences equivalence classes to decompose the problem: • two k-sequences belong in the same []i class if they have the same i-length prefix  • each class fits in main memory • generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix • minimizes I/O cost - 2 database scans: • frequent 1-sequences, frequent 2-sequences

  9. SPADE

  10. DFS_MINE • Depth-First-Search approach • fast discoveries of long maximal frequent patterns • uses minimal amount of memory • some frequent sequences are deduced to be frequent from lattice • candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems) • in main memory: • S.Useless: set of items sequence S must not be intersected with • MaxFreqList: List of Maximal Frequent Sequences • MinNonFreqList: List of Minimal Non Frequent Sequences • scan database to determine the support of candidate sequences

  11. In MaxFreqList ABCDE candidate BCD BCD candidate CD In MinNonFreqList MaxFreqList - MinNonFreqList Lemma: All subsequences of a frequent sequence are also frequent • S is inserted in MaxFreqList if: • S is not in MaxFreqList • S is not a subsequence of a sequence in MaxFreqList • S was scanned in database and was found to be frequent • Subsequences of S in MaxFreqList are removed.   • Supersequences S is inserted in MinNonFreqList if: • S is not in MinNonFreqList • S is not a supersequence of a sequence in MinNonFreqList • S was scanned in database and was found to be non-frequent • Supersequences of S in MinNonFreqList are removed.

  12. Examining Candidate Sequences • k-sequence S is intersected with all items Ijin FreqItems-S.Useless • resulting sets SET(S+Ij) for all Ij • each sequence S: • check MinNonFreqList • check MaxFreqList • scan database for all unknown sequences (if any) in SET(S+Ij) for all Ij(1pass) • update MaxFreqList, MinNonFreqList

  13. D AAA D ADAA AAAD ADAAD ADAAD   D D Generating sequences • k-sequence S + Ijin FreqItems-S.Useless = candidate (k+1)-sequences ABCD + E 1. EABCD  2. AEBCD  3. AEBCD  4. ABECD  5. ABECD  6. ABCED  7. ABCED 8. ABCDE  9. ABCDE  ABCD + D 1. DABCD  2. ADBCD  3. ADBCD  4. ABDCD  5. ABDCD 6. ABCDD  7. ABCDD  8. ABCDD  9. ABCDD • insert item Ij in all possible positions that follow its rightmost occurrence is a k-sequence S. If the item does not occur at all in the sequence, then it is inserted in all positions.

  14. SET(S+A,A) SET(S+A,B)=SET(S+B,A) SET(S+D,E) SET(S+B,B) A B A B SET(S+A) SET(S+D) A B SET(S+B) SET(S+E) E Sequence S Sequence S D E Useless Set of a sequence S • after intersecting S with item Ij, it is inserted in S.Useless • when intersecting S with item Ij, all items Ik(k<j) are in S.Useless • S.Useless is ‘inherited’ by the (k+1)-sequences produced DAB +E EDAB DEAB DEAB DAEB DAEB DABE DABE  Bound to be not frequent Scenario 1  AB+D DAB ADB ADB ABD ABD not frequent AB+E EAB AEB AEB ABE ABE Scenario 2

  15. Open Problems • Output subexponential maximal sequential pattern algorithms • Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)

  16. Temperature Map US Snow-ice-rain radar US Snow-ice-rain radar NE Precipitation radar Bay Area Precipitation radar Lakes Spatiotemporal Datasets

  17. Mining Spatiotemporal Data • CONQUEST, [Stolorz et al, KDD 1995] • Patterns in global climate change • SKICAT, [Fayyad et al, 1996] • Image processing techniques and classification techniques to identify objects in satellite pictures • GeoMiner [Han et al, 1997] • MultiMediaMiner, [Zaiane et al, 1998] • Data Cube structure. Mining of association and classification rules. • DFS-Mine, [Tsoukatos et al, 2001] • Discovery of spatiotemporal patterns

  18. Open Problems • Similarity models and indexing techniques for higher-dimensional time series • Efficient trend detection/subsequence matching algorithms • Algorithms to capture the data distribution when it changes over time • New models for capturing the evolution of spatial phenomena over time

More Related