1 / 30

Cache-Conscious Performance Optimization for Similarity Search

36 th ACM International Conference on Information Retrieval. Cache-Conscious Performance Optimization for Similarity Search. Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara. All Pairs Similarity Search (APSS).

lilah
Download Presentation

Cache-Conscious Performance Optimization for Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 36th ACM International Conference on Information Retrieval Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara

  2. All Pairs Similarity Search (APSS) • Definition: Finding pairs of objects whose similarity is above a certain threshold. • Application examples: • Collaborative filtering. • Spam and near duplicate detection. • Image search. • Query suggestions. • Motivation: APSS still time consuming for large datasets. Sim (di,dj) = cos(di,dj) ≥ τ 2

  3. Previous Work • Approaches to speedup APSS: • Exact APSS: • Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ] • Inverted indexing. [Arasu et al. VLDB’06] • Parallelization with MapReduce. [Lin SIGIR’09] • Partition-based similarity comparison [Maha WSDM’13] • Approximate APSS via LSH: Tradeoff between precision and recall plus addition of redundant computations. • Approaches that utilize memory hierarchy: • General query processing [ Manegold VLDB02 ] • Other computing problems. 3

  4. Baseline: Partition-based Similarity Search (PSS) [WSDM’13] Similarity comparison with parallel tasks Partitioning with dissimilarity detection 4

  5. PSS Task Memory areas: S = vectors owned, B = other vectors, C = temporary. • Read assigned partition into area S. Task steps: • Repeat • Read some vectors vi from other partitions • Compare vi with S • Output similar vector pairs • Until other potentially similar vectors are compared. 5

  6. Focus and Contribution • Contribution: • Analyze memory hierarchy behavior in PSS tasks. • New data layout/traversal techniques for speedup: • Splitting data blocks to fit cache. • Coalescing: read a block of vectors from other partitions and process them together. • Algorithms: • Baseline: PSS [WSDM’13] • Cache-conscious designs: PSS1 & PSS2 6

  7. PROBLEM1: PSS area S is too big to fit in cache S C Inverted index of vectors Accumulator for S … B … Other vectors … … … … Too Long to fit in cache! … 7 …

  8. C PSS1: Cache-conscious data splitting Accumulator for Si After splitting: … S1 … S2 … B Split Size? … aa aa aa aa aa aa aa aa … … Sq … 8

  9. PSS1 Task PSS1 Task Read S and divide into many splits Read other vectors into B For each split Sx Compare (Sx, B) Output similarity scores Compare(Sx, B) … for di in Sx for dj in B Sim(di,dj) += wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj <t) then … 9

  10. Modeling Memory/Cache Access of PSS1 Area Si Area B Sim(di,dj) + = wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < T ) then Area C Total number of data accesses : 10 D0 = D0(Si) + D0(B)+D0(C)

  11. Cache misses and data access time Memory and cache access counts: D0 : total memory data accesses. D1 : missed access at L1 D2 : missed access at L2 D3 : missed access at L3 Memory and cache access time: δi : access time at cache level i δmem : access time in memory. Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 11

  12. Total data access time Data found in L1 Total data access time ~2 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

  13. Total data access time Data found in L2 Total data access time 6-10 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

  14. Total data access time Data found in L3 Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 30-40 cycles

  15. Total data access time Data found in memory Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 100- 300 cycles

  16. Actual vs. Predicted Avg. task time ≈ #features * ( lookup + multiply + add) + accessmem 13

  17. C RECALL: Split size s Accumulator for Si … S1 … S2 … B … Split Size s aa aa aa aa aa aa aa aa … … Sq …

  18. Ratio of Data Access to Computation Data access computation Avg. task time ≈ #features * ( lookup + add+multiply) + accessmem Data access computation 15 Split size s

  19. PSS2: Vector coalescing • Issues: • PSS1 focused on splitting S to fit into cache. • PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. • Solution: coalescing multiple vectors in B

  20. PSS2: Example for improved locality Striped areas in cache Si C … … … … B … … … 16

  21. Evaluation • Implementation: Hadoop MapReduce. • Objectives: • Effectiveness of PSS1, PSS2 over PSS. • Benefits of modeling. • Datasets: • Twitter, Clueweb, Enron emails, YahooMusic, • Google news. • Preprocessing: • Stopword removal + df-cut. • Static partitioning for dissimilarity detection.

  22. Improvement Ratio of PSS1,PSS2 over PSS 2.7x 18

  23. RECALL: coalescing size b Si C … … … … B … b … … … Avg. # of sharing = 2 18

  24. Average number of shared features 19

  25. Overall performance

  26. Overall performance Clueweb

  27. Impact of split size s in PSS1 Clueweb Twitter Emails

  28. RECALL: split size s & coalescing size b Si C … … … … s B … b … … 20

  29. Affect of s & b on PSS2 performance (Twitter) fastest 21

  30. Conclusions • Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1) • Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2) • Cost modeling for memory hierarchy access is a guidance to optimize parameter setting. • Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.

More Related