1 / 47

Data Mining over Hidden Data Sources

This article explores the concept of Deep Web and focuses on data mining techniques applied to hidden data sources. It discusses various contributions including stratified K-means clustering, two-phase sampling based outlier detection, and hierarchical clustering over Deep Web data sources.

grantd
Download Presentation

Data Mining over Hidden Data Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012

  2. Outline • Introduction • Deep Web • Data Mining on the deep web • Contributions • Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012) • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (Submitted to ICDM 2012) • Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012) • An Active Learning Based Frequent Itemset Mining (ICDE, 2011) • Differential Rule Mining (ICDM Workshops, 2010) • Stratified Sampling for Deep Web Mining (ICDM, 2010) • Conclusion and Future work

  3. Deep Web • Data sources hidden from the Internet • Online query interface vs. Database • Database accessible through online Interface • Input attribute vs. Output attribute • An example of Deep Web

  4. Data Mining over the Deep Web • High level summary of data • Scenario 1: a user wants to relocate to the county. • Summary of the residences of the county? • Age, Price, Square Footage • County property assessor’s web-site only allows simple queries • Scenario 2: a user is thinking about his or her career path • High level knowledge about the job posts in the market • Job type, Salary, education, experience, skills, .. • Job web-site, i.e. Linkedin and MSN careers, provide millions of job posts.

  5. Challenges • Databases cannot be accessed directly • Sampling method for Deep web mining • Obtaining data is time consuming • Efficient sampling method • High accuracy with low sampling cost

  6. Contributions • Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012) • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (submitted to ICDM, 2012) • Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012) • An Active Learning Based Frequent Itemset Mining (ICDE, 2011) • Differential Rule Mining (ICDM Workshops, 2010) • Stratified Sampling for Deep Web Mining (ICDM, 2010)

  7. Roadmap • Introduction • Deep Web • Data Mining on the deep web • Contributions • Stratified K-means Clustering Over A Deep Web Data Source • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source • Stratification Based Hierarchical Clustering on a Deep Web Data Source • An Active Learning Based Frequent Itemset Mining • Differential Rule Mining • Stratified Sampling for Deep Web Mining • Conclusion and Future work

  8. An Example of Deep Web for Real-Estate

  9. k-means clustering over a deep web data source • Goal: Estimating k centers for theunderlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.

  10. Overview of Method Sample Allocation Stratification

  11. Stratification on the deep web • Partitioning the entire population in to strata • Stratifies on the query space of input attributes • Goal: Homogenous Query subspaces • Radius of query subspace: • Rule: Choosing the input attribute that mostly decreases the radius of a node • For an input attribute , decrease of radius:

  12. Partition on Space of Output Attributes

  13. Sampling Methods • We have created c*k partitions and c*k subspaces • A pilot sample • C*k-mean clustering generate c*k partitions • Representativesampling • Good Estimation on statistics of c*k subspaces • Centers • Proportions

  14. Representative Sampling-Centers • Center of a subspace • Mean vector of all data points belonging to the subspace • Let sample S={DR1, DR2, …, DRn}, • For i-th subspace, center :

  15. Distance Function • For c*k estimated centers with true centers • Using Euclidean Distance • Integrated variance • In terms of sub-space, stratum and output attributes • Computed based on pilot sample • : # of sample drawn from j-th stratum

  16. Optimized Sampling Allocation • Goal: • Using Lagrange multipliers: • We are going to sample stratum with large variance • Data is spread in a wide area, and more data are need to represent the population

  17. Active Learning based sampling Method • In machine learning • Passive learning: data are randomly chosen • Active Learning • Certain data are selected, to help build a better model • Obtaining data is costly and/or time-consuming • Choosing stratum i, the estimated decrease of distance function is • Iterative Sampling Process • At each iteration, stratum with largest decrease of distance function is selected for sampling • Integrated variance is updated

  18. Representative Sampling-Proportion • Proportion of a sub-space: • Fraction of data records belonging to the sub-space • Depends on proportion of the sub-space in each stratum • In j-th stratum, • Risk function • Distance between estimated parameters and their true values • Iterative Sampling Process • At each iteration, stratum with largest decrease of risk function is chosen for sampling • Parameters are updated

  19. Stratified K-means Clustering • Weight for data records in i-th stratum • , : size of population, : size of sample • Similar to k-means clustering • Center for i-th cluster

  20. Contribution • Sampling methods for solving the problem of clustering over a deep web data source • Representative Sampling • Partition on the space of output attributes • Centers • Optimized Sampling method • Active learning based sampling method • Proportions • Active learning based sampling method

  21. Experiment Result • Data Set: • Noisy Synthetic data set: • 4,000 data records with 4 input attributes and 2 output attributes. • Adding 400 noise data points • Yahoo! data set: • Data on used cars • 8,000 data records • Average Distance

  22. Representative Sampling-Noisy Data Set • Benefit of Stratification • Compared with rand, decrease of AvgDist are 26.9%, 35.5%, 37.4%, 38.6% • Benefit of Representative Sampling • Compared with rand_st, decrease of AvgDist are 11.8%, 14.4% and 16.1% • Center based sampling methods have better performance • Optimized sampling method has better performance in the long run

  23. Representative Sampling-Yahoo! Data set • Benefit of Stratification • Compared with rand, decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8% • Benefit of Representative Sampling • Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5% • Center based sampling methods have better performance • Optimized sampling method has better performance in the long run

  24. Scalability • The execution time for each method is linear of the size of data set

  25. Roadmap • Introduction • Deep Web • Data Mining on the deep web • Contributions • Stratified K-means Clustering Over A Deep Web Data Source • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source • Stratification Based Hierarchical Clustering on a Deep Web Data Source • An Active Learning Based Frequent Itemset Mining • Differential Rule Mining • Stratified Sampling for Deep Web Mining • Conclusion and Future work

  26. Outlier Detection • Outlier • An observation that deviates greatly from other observations • DB(p;D) outlier • At least fraction p of the objects lie at a distance greater than D. • Challenges for Outlier Detection over a deep web data source • Recall: finding as large a fraction of outliers • Precision: accurately identify outliers from sampled data

  27. Two-phase Sampling Method • Neighborhood Sampling • Aiming at improving recall • Query spaces with high probability of containing outliers are explored. • Uncertain Driven Sampling • Aiming at improving precision

  28. Outliers in Stratified Sampling • Stratified Sampling has better performance • Stratification • Similar to stratification in k-means clustering over a deep web data source • Control the number of strata • Outlier detection • For a data object, let denote the fraction of data objects at distance greater than D • Estimates in a stratified sample

  29. Neighbor Nodes • Similar data objects tend to be from same query subspaces or neighbor query subspaces • Neighbor Nodes for a node • Left and right cousin with same parent nodes

  30. Neighborhood Sampling Root Y=1980 Y=1990 Y=2000 Y=2010 B=1 B=2 B=3 B=4 B=1 B=2 B=3 B=4 Ba=1 Ba=2

  31. Post-Stratification • Original Strata are further stratified after additional sampling • New Stratum: Leaf nodes with same sample rate under the same original stratum • Each data record has estimated and variance • Fraction of data objects at distance greater than D • Probability of being an outlier

  32. Uncertain Driven Sampling • For a sampled data record • Outlier: > • Normal data object < • Otherwise, uncertain data object • Task: Obtain a sample for identifying uncertain data object

  33. Sample Allocation • For uncertain data objectswith estimated • To find better estimation of , Minimize • By using Lagrange multiplier

  34. Outlier in Stratified Sampling • Distance between each pair of sampled data object is computed • For a sampled data record • Outlier: > • Otherwise, Normal data object • An outlier: • A normal data object • where denotes the fraction of neighbors in D neighborhood

  35. Efficient Outlier Detection • It can be shown that • Sufficient condition • If • A normal data object • An outlier • Else • A normal data object • An outlier

  36. Experiment Result • Data Set: • Yahoo! data set: • Data on used cars • 8,000 data records • Evaluation • Precision: fraction of outliers that are identified in the sample • Recall: fraction of outliers that are sampled

  37. Recall • Benefit of Stratification • Increase over SRS: 108.2%, 116.7%, and 74.7% • Benefit of neighborhood Sampling • Increase over SSTS: 19.1% and 28.1% • Uncertain sampling decrease recall: 3.7%

  38. Precision • All four methods have good performance • Average precision is over 0.9 • Stratified sampling methods have lower precision • Compared with SSTS, decrease: 1.7%, 0.68% and 4.3% • Benefits of uncertain sampling • Compared with NS, increase: 2.7%

  39. Trade-off between Precision and Recall • Trade-off between Precision and Recall • Benefit ofStratification • TPS , NS and SSTS improves recall for precision in 0.75-0.975 • Benefit of Neighborhood Sampling • TPS , NS improves recallfor precision in 0.75-0.975 • Benefit of Uncertain Sampling • TPS improves recall for precision in 0.92-1.0

  40. Roadmap • Introduction • Deep Web • Data Mining on the deep web • Contributions • Stratified K-means Clustering Over A Deep Web Data Source • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source • Stratification Based Hierarchical Clustering on a Deep Web Data Source • An Active Learning Based Frequent Itemset Mining • Differential Rule Mining • Stratified Sampling for Deep Web Mining • Conclusion and Future work

  41. Stratification Based Hierarchical Clustering on a Deep Web Data Source • Hierarchical clustering based on stratified sampling • Stratification • Sample Allocation • Representative Sampling • Mean values of output attributes are close to true values • Uncertain Sampling • Sample heavily on boundary between clusters

  42. An Active Learning Based Frequent Itemset Mining • Frequent Itemset Mining • Estimating support for itemsets • The size of itemsets could be huge • Considering 1-itemsets • Bayesian Network • Model the relationship between input attributes and output attributes • Risk Function on estimated parameters • Active learning Based Sampling • Data records are selected step by step • Sample query subspaces with greatest decrease on risk function

  43. Differential Rule Mining • Different values for the same data object • e.g. prices of commodities • Goal: analyzing the difference between data sources • Differential Rule: • Left hand: a frequent itemset • Right hand: behavior of differential attribute • Differential Rule Mining • AprioriAlgorithm • Hypothesis Statistical test

  44. Stratified Sampling for Association Rule Mining and Differential Rule Mining • Data Mining • Association Rule Mining & Differential Rule Mining • Stratified Sampling • Stratification • Combing estimation variance and sampling cost • A tree recursively built on the query space • Sampling Allocation • An optimized method for minimizing integrated cost on variance and sampling cost

  45. Conclusion • Data mining on the deep web is challenging • We proposed methods for data mining on the deep web • A stratified K-means clustering method • A two-phase sampling based outlier detection • A stratified hierarchical clustering method • An Active learning based frequent itemset mining • A stratified sampling method for data mining on the deep web • Differential rule mining • The experiment results show the efficiency of our work

  46. Future Work • Outlier Detection over a deep web data source • Consider the problem of statistical distribution based outlier detection • Mining Multiple Deep Web Data Sources • Instance-based Schema matching • Efficiently sampling instance from deep web to facilitate schema matching • Mining data coverage of multiple deep web data sources • Efficient sampling methods for estimating data coverage of multiple data sources

  47. Questions?

More Related