1 / 34

Robust Space Transformations for Distance-based Operations in KDD

This seminar outlines the motivation, objective, and methodology of proposing a robust space transformation method called the Donoho-Stahel Estimator (DSE) for distance-based operations in Knowledge Discovery in Databases (KDD). The DSE aims to address issues related to scaling, variability, outliers, and correlation in data spaces. Key properties, algorithms, and the experimental evaluation of the DSE are discussed, highlighting its stability and efficiency. The paper also introduces the k-D Subsampling and k-D Randomized Algorithms for high-dimensional databases. Overall, the study aims to enhance the accuracy and reliability of distance-based operations through innovative space transformations.

Download Presentation

Robust Space Transformations for Distance-based Operations in KDD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Space Transformations for Distance-based Operations Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar

  2. Outline • Motivation • Objective • Introduction • Donoho-Stahel Estimator • Key Properties of The DSE • k-D Subsampling Algorithm • k-D Randomized Algorithm • Experimental Evaluation • Conclusions • Personal Opinion IDSL

  3. Motivation • For many KDD operations, there is an underlying k-D data space • Each tuple is represented as a point in the space • We may get unintuitive results if an inappropriate space is used • In the presence of differing scales, variability, correlation, and outliers IDSL

  4. Objective • Propose a robust space transformation • Develop algorithms and evaluate how well they perform empirically IDSL

  5. Introduction • Example 1(Nearest Neighbor Search) • Systolic blood pressure(100-160 mm) • Body temperature(37 degrees Celsius) • Age(20-50 years) • The results are dominated by blood pressure • Consider query point (120,37,35) • Using Euclidean distance ,the point (120,40,35) is nearer to the query point than (130,37,35) is IDSL

  6. Fix to The Problem • Weight • Normalization • Standardization • But these transformation do not take into account • Outliers • Correlation between attributes IDSL

  7. Contributions of This Paper • Propose a robust space transformation called the Donoho-Stahel estimator (DSE) • Propose a version of the DSE for high-dimensional database (subsampling) • Develop a new algorithm(Hybrid-random algorithm) for computing the DES efficiently IDSL

  8. Related Work • Space transformations are from the class of distance-preserving transformations • Principal component analysis(PCA) is useful for data reduction • Many clustering algorithms are distance-based or density-based • Outlier detection has received considerable attention in recent years IDSL

  9. Donoho-Stahel Estimator • The DSE is a robust multivariate estimator of location and scatter • It is an “outlyingness-weighted” mean and covariance • It also possesses statistical properties such as affine equivariance • DSE is easier to computer, scales better, and has much less bias IDSL

  10. The DSE Fixed-angle Algorithm for 2-D IDSL

  11. The DSE Fixed-angle Algorithm for 2-D • Robustness is achieved by first identifying outlying points, and then downweighting their influence • Step 1 computes the degree of outlyingness of the point with respect to  • Step 2 computes the maximum degree of outlyingness over all possible ’s • In step 3, if this maximum degree for a point is to high, the influence of this point is weakened • Finally, the location center and the covariance matrix are computed IDSL

  12. Key Properties of The DSE • Euclidean property • Stability property IDSL

  13. Euclidean Property • It says that while inappropriate in the original space, the Euclidean distance function becomes reasonable in the DSE transformed space IDSL

  14. Stability Property • Which says that in spite of frequent updates, the estimator does not: • Change much • Lose its usefulness • Require re-computation IDSL

  15. Stability Experiment • Outlier detection • Used the old estimator to transform the space for Dnew and then found all the outliers in Dnew • Used the updated estimator to transform the space for Dnew and then found all the outliers in Dnew • Use standard precision and recall to measure the difference between the two sets of detected outliers IDSL

  16. Results • The DSE transformation is stable IDSL

  17. Complexity of the Fixed-angle Algorithm • How to compute efficiently, for k > 2 dimensions? • Complexity of the Fixed-angle Algorithm • O(ak-1kN + k2N)=O(ak-1kN ) • Impractical for large values of a and k • There is an exhaustive enumeration of ’s IDSL

  18. Projects of Points Onto a Line IDSL

  19. Intuition Behind The Subsampling Algorithm • Use orthogonal lines increases chance of detecting outliers • Because many outliers are likely to stand out after the orthogonal projection • In applying subsampling, our goal is to use lines orthogonal to the axes of an ellipse(or ellipsoid in k-D) • However, the parameters of the ellipsoid are unknow • There will be too many points and dimensions IDSL

  20. k-D Subsampling Algorithm IDSL

  21. Details of the Subsampling Algorithm • How to compute m? • A subsample is good if all k-1 points are within the ellipsoid • Let  be the fraction of points outsides the ellipsoid.(typically 0.01~0.5) • m can be the smallest number of subsamples such that there is at least a 95% probability that we get at least one good subsample out of the m subsamples • We can determine a base vale of m by solving the following inequality • 1-(1-(1- )k-1)m>=0.95 IDSL

  22. Complexity of the Subsampling Algorithm • O(mk3 + k2N) • The mk3complexity factor is still costly when the number of subsamples m is large(i.e., for a high quality estimator) • How the k-D DSE estimator can be computed more efficiently? IDSL

  23. k-D Randomized Algorithms • Pure-random Algorithm • Hybrid-random Algorithm • Combines part of the Pure-random algorithm with part of the Subsampling algorithm IDSL

  24. Pure-random Algorithm • Fixed-angle algorithm need ak-1 projection unit vectors examined • It is possible randomly selects r projections to examine • O(rkN + k2N)=O(rkN) IDSL

  25. Hybrid-random Algorithm • The Pure-random algorithm probes the k-D space blindly • The value of r may need to be high for acceptable quality • Whether random draws project vectors can be done more intelligently IDSL

  26. IDSL

  27. Hybrid-random Algorithm • Steps 1 to 3 use the Subsampling algorithm to find some initial projection vectors and keep them in S • In each iteration of step 4, a new random projection vector is generated in such a way that it stays clear of existing projection vectors • O(mk3 + rkN) IDSL

  28. Experimental Evaluation • Experimental Setup • Distance-based outlier detection operation • Use precision and recall to compare the results • An 855-record dataset consisting of 1995-96 National Hockey League(NHL) player performance statistics • We created synthetic datasets containing up to 100,100 tuples-whose distribution mirrored that of the base dataset • Attributes are goals, assists, penalty minutes, shots on goal, and games played IDSL

  29. Usefulness of Donoho-Stahel Transformation • The range for penalty-minutes was [0,335], and the range for goals-scored was [0,69] IDSL

  30. Internal Parameters of the Algorithms IDSL

  31. Comparison IDSL

  32. Scalability in Dimensionality and Dataset Size IDSL

  33. Conclusions • Many type of distance-based KDD operations tend to be less meaningful • When no attention is paid to scale, variability, correlation, and outliers in the underlying data • An appropriate space is one that • Preserves the Euclidean property • Is stable of updates • The Hybrid-random algorithm is an effective and efficient algorithm to be a robust estimator IDSL

  34. Personal Opinion • Data preprocessing can help improve the quality of the data and the mining results • Space transformation may improve the accuracy and efficiency of mining algorithms involving distance measurements IDSL

More Related