1 / 15

CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams

CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams. Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen. Introduction. Partitioning

thu
Download Presentation

CDS-Tree: An Effective Index for Clustering Arbitrary Shapes in Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CDS-Tree: An Effective Index for Clustering Arbitrary Shapesin Data Streams Huanliang Sun, Ge Yu, Yubin Bao, Faxin Zhao, Daling Wang RIDE-SDMA’05 Advisor:Jia-Ling Koh Speaker:Tsui-Feng Yen

  2. Introduction • Partitioning -k-means and k-medians algorithms don’t emphasize on finding arbitrary shapes in data streams • Density-based -DBSCAN can find arbitrary shapes in data streams, but need to scan database more than one time • Cell-based (Grid-based) -CLIQUE has three problems -high complexity -high memory -accuracy is not good with limited memory for changing data streams

  3. Problem Definition • Domain:A={A1,A2,…,Ak} • S= A1xA2x ... 􀁵 xAk be a k-dimensional numerical space. A1, A2,…,Ak as the dimensions (attributes) of S • A k-dimension data stream X={x1, x2, …, xn} is a set of ordered objects at t time point, where xi=<xi1, xi2,…, xik>, and xij, the jth component of xi, is drawn from domain Aj.

  4. Definition • Sliding window model on data stream X -B1 is the most recent bucket , and Bu is the oldest -The window slides by creating a new bucket and discarding a oldest one

  5. Definition cont. • Partition P of data stream X -P be a set of non-overlapping rectangular cells, which is obtained by partitioning every dimension of X into equal length -Each cell C is the intersection of one interval from each dimension. It is represented as the form {c1,c2,…,ck} -A cell can also be denoted as (cNO1, cNO2, …, cNOk)named the coordinate of the cell, where cNOi is the interval number of the cell on i-th dimension

  6. Definition cont. • Selectivity pc of cell C -The number of points that belong to Cdefines the selectivity pc of cell C • Clustering based on cells data stream X in a sliding window -If the selectivity of a cell is larger than a threshold τ, we call the cell dense -A cluster is the largest set of cells that are adjacent and dense -Two cells C1 and C2 are connective when they are neighboring, or there exists a cell C3, C1 and C3 are neighboring, C2 and C3 are neighboring

  7. CDS-Tree data stream coming:(2,3),(5,4),(6,5) root-node mid leaf total-num-list

  8. Related Algorithms of CDS-Tree • CDS-Tree building algorithm

  9. Related Algorithms of CDS-Tree • Clustering algorithm based on CDS-Tree.

  10. Granularity Adjustment • -thefiner the partition is, the higher the accuracy is, but the more number of the cells is created -if the current cost memory Mp is far less thanMmax, we can execute finer granularity partition for higher accuracy. -if the current memory cost Mp is close toMmax, we should use coarser partition to avoid memory overflow.

  11. Granularity Adjustment cont. • Safety factor (in case of exhausting memory) -λ:is used to avoid the memory required exceeding the limited memory Mmax when the granularity turns finer, here we set it larger than 1. -η:we set it to decide the time point to adjust the granularity, where ηis less than 1. For example, 􀈘 is set 0.1, which represents when left memory is less than 10% of Mmax, the algorithm will turn granularity coarse to save more memory.

  12. Granularity Adjustment Algorithm

  13. Experimental Results • OS: Microsoft Windows 2000 • CPU: 2.5GHz • RAM: 512MB • Two databases: -KDD-CUP-99 Network Intrusion Detection stream dataset -Image Fourier Coefficient dataset

  14. Experimental Results

  15. Experimental Results

More Related