Mining Sequential Patterns

Mining Sequential Patterns Dimitrios Gunopulos, UCR

Finding Frequent Sequential Patterns • The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. • For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W. • Applications in telecommunication data, networks, biology

time t1 Later time t2 item item attribute value itemset item Sequences • Sequence ((T=90F)  (H=60%, P=1.1atm)) • k-sequence: sequence with k items • T1H2P1T3P2, P1T2H4P2T5: 5-sequences • S1 is subsequence of S2 (S1  S2) • T1P1T2  H1T1P2H2P1T2 (T1H1T1 , P1T2H2P1T2) • H1P1T2  H1T1P2H2P1T2

Sequential Patterns: The Problem • support or frequency of a sequence S ((S)): • = the total number of times sequence S is encountered • user specified minimum threshold min_sup • S is frequent  (S) min_sup • S:maximal frequent sequenceS is frequent and all of its supersequences are non-frequent • S:minimal non-frequent sequenceS is non frequent and all of its subsequences are frequent • The problem • Given: database D and min_sup • the problem: find all frequent sequences in D

Example Database

Algorithms for Sequential Patterns • Apriori, GSP [Srikant, Agrawal, EDBT 1996] [Mannila, Toivonen, Verkamo, DMKD 1997] • SPADE, Parallel Spade [Zaki, 2001] • FreeSpan, PrefixSpan [Han et al, SIGKDD 2000], [Pei et al, ICDE 2001] • Sequential Patterns with constraints [Garofalakis et al, VLDB 99] • DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]

The Lattice Structure • Lemma: All subsequences of a frequent sequence are frequent

SPADE ([Zaki, 2001]) • Lattice-based approach • vertical id-list format • enumerates all frequent sequences equivalence classes to decompose the problem: • two k-sequences belong in the same []i class if they have the same i-length prefix  • each class fits in main memory • generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix • minimizes I/O cost - 2 database scans: • frequent 1-sequences, frequent 2-sequences

SPADE

DFS_MINE • Depth-First-Search approach • fast discoveries of long maximal frequent patterns • uses minimal amount of memory • some frequent sequences are deduced to be frequent from lattice • candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems) • in main memory: • S.Useless: set of items sequence S must not be intersected with • MaxFreqList: List of Maximal Frequent Sequences • MinNonFreqList: List of Minimal Non Frequent Sequences • scan database to determine the support of candidate sequences

In MaxFreqList ABCDE candidate BCD BCD candidate CD In MinNonFreqList MaxFreqList - MinNonFreqList Lemma: All subsequences of a frequent sequence are also frequent • S is inserted in MaxFreqList if: • S is not in MaxFreqList • S is not a subsequence of a sequence in MaxFreqList • S was scanned in database and was found to be frequent • Subsequences of S in MaxFreqList are removed.   • Supersequences S is inserted in MinNonFreqList if: • S is not in MinNonFreqList • S is not a supersequence of a sequence in MinNonFreqList • S was scanned in database and was found to be non-frequent • Supersequences of S in MinNonFreqList are removed.

Examining Candidate Sequences • k-sequence S is intersected with all items Ijin FreqItems-S.Useless • resulting sets SET(S+Ij) for all Ij • each sequence S: • check MinNonFreqList • check MaxFreqList • scan database for all unknown sequences (if any) in SET(S+Ij) for all Ij(1pass) • update MaxFreqList, MinNonFreqList

D AAA D ADAA AAAD ADAAD ADAAD   D D Generating sequences • k-sequence S + Ijin FreqItems-S.Useless = candidate (k+1)-sequences ABCD + E 1. EABCD  2. AEBCD  3. AEBCD  4. ABECD  5. ABECD  6. ABCED  7. ABCED 8. ABCDE  9. ABCDE  ABCD + D 1. DABCD  2. ADBCD  3. ADBCD  4. ABDCD  5. ABDCD 6. ABCDD  7. ABCDD  8. ABCDD  9. ABCDD • insert item Ij in all possible positions that follow its rightmost occurrence is a k-sequence S. If the item does not occur at all in the sequence, then it is inserted in all positions.

SET(S+A,A) SET(S+A,B)=SET(S+B,A) SET(S+D,E) SET(S+B,B) A B A B SET(S+A) SET(S+D) A B SET(S+B) SET(S+E) E Sequence S Sequence S D E Useless Set of a sequence S • after intersecting S with item Ij, it is inserted in S.Useless • when intersecting S with item Ij, all items Ik(k<j) are in S.Useless • S.Useless is ‘inherited’ by the (k+1)-sequences produced DAB +E EDAB DEAB DEAB DAEB DAEB DABE DABE  Bound to be not frequent Scenario 1  AB+D DAB ADB ADB ABD ABD not frequent AB+E EAB AEB AEB ABE ABE Scenario 2

Open Problems • Output subexponential maximal sequential pattern algorithms • Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)

Temperature Map US Snow-ice-rain radar US Snow-ice-rain radar NE Precipitation radar Bay Area Precipitation radar Lakes Spatiotemporal Datasets

Mining Spatiotemporal Data • CONQUEST, [Stolorz et al, KDD 1995] • Patterns in global climate change • SKICAT, [Fayyad et al, 1996] • Image processing techniques and classification techniques to identify objects in satellite pictures • GeoMiner [Han et al, 1997] • MultiMediaMiner, [Zaiane et al, 1998] • Data Cube structure. Mining of association and classification rules. • DFS-Mine, [Tsoukatos et al, 2001] • Discovery of spatiotemporal patterns

Open Problems • Similarity models and indexing techniques for higher-dimensional time series • Efficient trend detection/subsequence matching algorithms • Algorithms to capture the data distribution when it changes over time • New models for capturing the evolution of spatial phenomena over time

Mining Sequential Patterns