Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Byoung-Kee Yi

國立雲林科技大學National Yunlin University of Science and Technology • Online Data Mining for Co-Evolving Time Sequences • Advisor：Dr. Hsu • Graduate：Chien-Shing Chen • Author：Byoung-Kee Yi • N.D.Sidiropoulos • Theodore Johnson Data Engineering, 2000. Proceedings. 16th International Conference on , 29 Feb.-3 March 2000

Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Related Work • MUSCLES • Experimental Results • Conclusions • Personal Opinion • Review

Motivation • N.Y.U.S.T. • I.M. • The data of interest comprises multiple sequences that each evolve over time. • These sequences are not independent: in fact, they frequently exhibit high correlations. • What we desire is to study the entire set of sequences as a whole, where the number of sequences in the set can be very large.

Objective • N.Y.U.S.T. • I.M. • We develop a fast method to analyze such co-evolving time sequences jointly • Estimation/forecasting of missing /delayed /future values • discovering correlations • Quantitative data mining • Outlier detection

Introduction • N.Y.U.S.T. • I.M. • Much useful information is lost if each sequence is analyzed individually. • Our method is to do the our prediction for the last “current” value of this sequence, given all the past information about this sequence, and all the past and current information for the other sequences. • Missing value, correlation detection, selective

N.Y.U.S.T. • I.M.

Related Work • N.Y.U.S.T. • I.M. • The traditional, highly successful methodology for forecasting is the so called Box-Jenkins, ARIMA. • ARIMA: linear dependency of the future value on the past values. • DeCoste:linear regression and NN for multivariate time sequences, limited to outlier detection and doesn’t scale well for large set of dynamically growing time sequences. • Goldin and Kanellakis:DFT coefficients, for sub-pattern matching.

MUSCLES • N.Y.U.S.T. • I.M. • MUSCLES(Multi-SequenCeLEast Squares) • Our proposed solution is to set up the problem as a multivariate linear regression.

N.Y.U.S.T. • I.M.

MUSCLES • N.Y.U.S.T. • I.M. • We try to estimate its value as a linear combination of the values of the same and the other time sequences within a windows of size w.

MUSCLES-delayed sequence • N.Y.U.S.T. • I.M. • For a sample s[t] from a time sequence s = • delays it by d time steps, that is, multi-variate regression.

ARIMA Models for time series • N.Y.U.S.T. • I.M. • Yt = Yt-1 + t 1:= 15.00 2:= 16.00 3:= 20.00 4:= 22.00 5:= 25.00 6:= 23.00 7:= 21.00 8:= 20.00 9:= 18.00 10:= 16.00 11:= 15.00 12:= 13.00 13:= 16.00 14:= 17.00

N.Y.U.S.T. • I.M. • Each column of the matrix X consists of sample values of the corresponding independent variable, and each row is observations made at time t. y is a vector of desired values

N.Y.U.S.T. • I.M. • With this set up, the optimal regression coefficients are given by

N.Y.U.S.T. • I.M. • gives the best regression coefficients, it is very inefficient in terms of both storage requirement and computation time. • O(N * v) storage for the matrix X. • The number of samples N is not fixed and can grow indefinitely. • With limited main memory, the computation may require quadratic disk I/O operations very much.

N.Y.U.S.T. • I.M. • can be incrementally computed using previous value . This method is called Recursive Least Square (RLS) and its computation cost is reduced to

forgetting factor • N.Y.U.S.T. • I.M. • Formulae derived based on old observed values will no longer be correct. • It turns out that our MUSCLES can be slightly modified to “forget” older samples gracefully. We call the method Exponentially Forgetting MUSCLES. • Let be the forgetting factor, which determines how fast the effect of older samples fades away.

MUSCLES-missing value • N.Y.U.S.T. • I.M. • Let one value, si[t], be missing. Make the best guess for , given all the information available. • Then, at time t, one is immediately able to reconstruct the missing or delayed value, irrespective of which sequence i it belongs to.

N.Y.U.S.T. • I.M. • Correlation detection: A high absolute value for a regression coeffieient means that the corresponding variable is highly correlated to the dependent variable (or current status of a seuqnece) as well as it is valuable for the estimation of the missing value.

N.Y.U.S.T. • I.M. • On-line outlier detetion: If we assume that the estimation error follows a Gaussian distribution with standard deviation .The rsult is that , 95% of the probability mass is within from the mean. • Corrupted data and back-casting: We can treat it as “delayed” and forecast it.

Experimental Set-up • N.Y.U.S.T. • I.M. • Currency: exchange rates of k=6 ,HKD,JPY,USD,DEM,FRF,GBP,CAD, and N=2561 daily observations for each currency. • Modem: k=14, N=1500 time-ticks • Internet: k=4 (e.g., connect time, traffic and error in packets etc.). For each of the data streams, N=980 observations were made.

N.Y.U.S.T. • I.M.

Correlation detecion- visualization • N.Y.U.S.T. • I.M. • The correlation coefficient picks the single predictor for a given sequence, where high absolute values show strong correlations. • We can trun it into a dis-similarity function, and appply FastMap to obtain a low dimensionality scatter plot of our sequences.

N.Y.U.S.T. • I.M.

Scaling-up:Selective MUSCLES • N.Y.U.S.T. • I.M. • We propose to do some preprocessing of a training set, to find a promising subset of sequences, and to apply MUSCLES only to those promising ones. • Give v independent variables x1,x2,x3,x4,….,xv and a dependent variable y with N samples each, find the best b(<v) independent variables to minimize the mean-sequare error.

Scaling-up:Selective MUSCLES • N.Y.U.S.T. • I.M. • Give independent variables x1,x2,x3,….,x and a dependent variable y with N samples each, find the best b(< ) independent variables to minimize the mean-square error(EEE) for for the given samples.

Scaling-up:Selective MUSCLES • N.Y.U.S.T. • I.M. • Given a dependent variable y, and independent variables with unit variance , the best single variable to keep to minimize EEE(S) is the one with the highest absolute correlation coefficient with y. • We propose to use a greedy algorithm. At each step , we select the independent variable that minimizes the EEE for the dependent variable y.

N.Y.U.S.T. • I.M.

Scaling-up:Selective MUSCLES • N.Y.U.S.T. • I.M. • Bottleneck of the algorithm is clearly the computation of EEE. Since it computes EEE approximately times and each computation of EEE requires in average, the overall complexity mounts to

Experiments-speed • N.Y.U.S.T. • I.M. • How much faster the Selective MUSCLES method is than MUSCLES, and at what cost in accuracy. • Figure 4 shows the speed-accuracy trade-off Selective MUSCLES • b = 3~5 best-picked variables suffice for accurate estimation. • Selective MUSCLES is very effective

N.Y.U.S.T. • I.M.

Concluding • N.Y.U.S.T. • I.M. • build analytical models for co-evolving time sequences • discovering correlations • forecasting of missing/delayed values • changing correlations among time sequences • less storage,reduce I/O operations • more robust method called Least Median of Squares is promising.

Personal Opinion • N.Y.U.S.T. • I.M. • This method can be used in our lab’s.

Review • N.Y.U.S.T. • I.M. • MUSCLES • Missing value • Correlation detection • Outlier • Corrupted data and back-casting • Experimental-Accuracy • Correlation detection-Visualization • Scaling-up:Selective MUSCLES • Speed-accuracy trade-off

Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Byoung-Kee Yi