1 / 16

KLEE : A Framework for Distributed Top-k Query Algorithms

KLEE : A Framework for Distributed Top-k Query Algorithms. Sebastian Michel Peter Triantafillou Gerhard Weikum VLDB 2005 Presented by Amrita Tamrakar. Overview. Problem Statement KLEE The Histogram Bloom Structure Candidate Filtering Conclusion.

abiba
Download Presentation

KLEE : A Framework for Distributed Top-k Query Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KLEE: A Framework for Distributed Top-k Query Algorithms Sebastian Michel Peter Triantafillou Gerhard Weikum VLDB 2005 Presented by Amrita Tamrakar

  2. Overview • Problem Statement • KLEE • The Histogram Bloom Structure • Candidate Filtering • Conclusion

  3. Problem Statement:Query with t terms with index lists spread across m peers P1 ... Pm Each peer Pj stores one inverted index over a term t The top-k result = sorted list (docID,TotalScore) where TotalScore for docId = monotonic aggregation of scores of this document in all m index lists.

  4. d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 … d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 t2 … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 t3 … Problem Definition: P0 is the peer where query is initiated P1 P0 P2 P3 • Problem to be considered: • network consumption • per peer load • latency (query response time) • processing

  5. Naïve Solution • All m peers to send the complete index lists to Pinit and then execute a centralized TA style method • Execute TA at Pinit and access the remote index lists one entry at a time. (more message rounds needed!)

  6. KLEE: • Different philosophy: approximate answers! • Efficiency: • Reduces (docId, score)-pair transfers • no random accesses at each peer • Two pillars: • The HistogramBlooms structure • The Candidate List Filter structure

  7. KLEE Steps: • Exploration Step: get a better approximation of min-k score threshold (topKScore) • Optimization Step: decide: 3 or 4 steps ? • Candidate Filtering: adocID is a good candidate if high-scored in many peers. • Candidate Retrieval: get all good docID candidates.

  8. Histogram Bloom Structure Each peer pre-computes for each index list: an equi-width histogram - Bloom filter for each cell - average score per cell - upper/lower score

  9. Bloom Filter • A space efficient probabilistic data structure that is used to test whether an element is a member of a set • vector V of m bits initially all set to 0 • K hash functions with range from 1…m • insert n docs by hashing the ids and settings the corresponding bits • Trade off : accuracy vs. efficiency A Bloom Filter with 4 hash functions. a Є A Given a query b, we will check bits at positions h1(b), h2(b), ..., hk(b). If any of them is 0 then b is not in the set A

  10. current candidate top-k set - c cells c cells top top 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 k k 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 b bits 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 b bits 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 candidates candidates 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 Histogram Histogram topKScore / m topKScore / m Exploration Step Coordinator Peer P0 Cohort Cohort Peer Pj Peer Pi score score ... ... Index List Index List

  11. Exploration Step: To Calculate topKScore Pinit has to • Find the missing score • Find the missing document if they are not present in the index list of some peers Pj. Uses the bloom filter of that peer to find out where the document may lie in the histogram cell and get the average of that cell as the score of that document. Replace all missing scores, Pinit computes the top-k list and identifies the score of the kth document in the list as topKScore

  12. topKScore / m threshold Candidate List Filter Matrix • Goal:filter out unpromising candidate documents in step 2 • estimate the max number of docs that are above the mink / m threshold (Maximum_size_candidate_list) number of documents score • Send this number and the threshold to the peers

  13. 000000001000000100000000000100000 Candidate List Filter Matrix (CLFM) Select all columns with at least R bits set Candidate List Filter Matrix Each peer returns a Bloom Filter that “contains” all docs above the topKScore / m threshold 1 010101001011110101001001010101001 For m peers CLF 010010011001011111001001010111110 .. ..m 101010101010100110010010011110000 Redefined CLF

  14. candidate filter matrix x x x 0000100000100000001 candidates min-k / m KLEE : Candidate Filter Coordinator Peer P0 candidate set current top-k min-k / m Cohort Peer Pi Cohort Peer Pj 010010000100010001 100010100000010001 top k 0000100000100000001 0000100000100000001 score ... Index List

  15. candidate filter matrix x x x 0000100000100000001 candidates early stopping point min-k / m Coordinator Peer P0 candidate set current top-k Cohort Peer Pi Cohort Peer Pj 010010000100010001 100010100000010001 top k 0000100000100000001 0000100000100000001 score ... Index List

  16. Conclusion • KLEE: approximate top-k algorithms for wide-area networks • significant performance benefits can be enjoyed, at only small penalties in result quality • flexible framework for top-k algorithms, allowing for trading-off • efficiency versus result quality and • bandwidth savings versus the number of communication phases. • various fine-tuning parameters

More Related