1 / 16

Bloom Filters

Bloom Filters. Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. Allow false positive errors, as they only cost us an extra data access.

jael-glenn
Download Presentation

Bloom Filters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bloom Filters • Lookup questions: Does item “x” exist in a set or multiset? • Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data. • Allow false positive errors, as they only cost us an extra data access. • Don’t allow false negative errors, because they result in wrong answers.

  2. Bloom Filter [B70] • Encoding an attribute aU • Maintain a Bit Vector V of size m • Use k hash functions (h1..hk) , hi: U[1..m] • Encoding: For item x, “turn on” bits V[h1(x)]..V[hk(x)]. • Lookup: Check bits V[h1(i)]..V[hk(i)] . If all equal 1, return “Probably Yes”. Else “Definitely No”.

  3. 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 Bloom Filter x V0 Vm-1 h1(x) h2(x) h3(x) hk(x)

  4. 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 Bloom Errors a b c d V0 Vm-1 h1(x) h2(x) h3(x) hk(x) x didn’t appear, yet its bits are already set

  5. Error Estimation • Assumption: Hash functions are perfectly random • Probability of a bit being 0 after hashing all elements: • Let p=e-kn/m, probability of a false positive is: • Assuming we are given m and n, the optimal k is:

  6. Bloom Filter Tradeoffs • Three factors: m,k and n. • Normally, n and m are given, and we select k. • Small k • Less computations. • Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too. • However, less bits need to be stepped over to generate an error. • For big k, the exact opposite holds. • Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0.5

  7. Summary Cache [FCAB00] • Proxy servers maintain local cache to minimize expensive internet requests. • Proxy must maintain an efficient lookup method into the cache. • The lookup structure must be stored in DRAM for performance. • Structure must be compact, as DRAM is expensive and is used for “Hot Items” storage and more. • Pages are usually replaced in the cache using an LRU algorithm.

  8. Proxy Proxy Proxy Proxy Cache Cache Cache Cache ICP – Request Handling Client Internet

  9. Internet Cache Protocol (ICP) • Allows for scaling-out when using proxies. • Protocol that supports discovery and retrieval of documents from neighboring caches. • Establish an hierarchy of proxy caches • If page not found in local proxy cache, it searches for the page in neighboring proxies. • If page not found anywhere, fetch it from the internet.

  10. Proxy Proxy Proxy Proxy Cache Cache Cache Cache ICP – Request Handling Client Internet

  11. Summary Cache • Each proxy maintains a Bloom Filter representing its local cache. • Also, it holds Bloom Filters representing caches of other proxies. • Updates to Bloom Filters are exchanged periodically or after a certain percentage of the documents in the cache was replaced. • ICP request is sent only to proxy who supposedly holds the requested document.

  12. Proxy Proxy Proxy Proxy Cache Cache Cache Cache ICP – With Summary Cache Client Internet

  13. Summary Cache – Bloom Filters • To support deletions and updates, the proxy maintains the Bloom Filter and also an array of counters C, initially set to 0. • The Bloom Filter is filled with the contents of the cache. • Each bit in the BF is allowed 4 bits for its counter. • On insert of item i, all C[hj(i)] are increased (to a maximum of 15). • On deletion of item i, counters are decreased. • When C[i] increases from 0 to 1, V[i] is turned on. • When C[i] decreases from 1 to 0, V[i] is turned off.

  14. Summary Cache – Bloom Filters • Hashing scheme • Generate 128 bits using MD5 on the URL. • Divide to segments of M bits (usually 32) • Calculate modulus of segments by m, providing 128/M hash values (4, for 32 bit segments) • If 128 bits are not enough, calculate MD5 of URL concatenated with itself. • Bloom Filter Exchange • Header contains MD5 properties, size of array. • If refresh rate is high, send only deltas. • Bit counts are internal and not exchanged. • Otherwise, send entire Bloom Filter.

  15. Summary Cache - Errors • False Misses • Document requested is cached at some remote proxy, but summary does not reflect that fact. • Hit ratio is reduce, a redundant internet access is performed. • False Hits • Document is not at a remote proxy, but summary suggests that it is. • An Inter-Proxy query message is wasted. • Remote Stale Hits • Document is cached at a remote proxy, but is stale. • Occurs in both ICP and Summary Cache. • Might not be a totally wasted effort, as delta compression can be used.

  16. Implementation - Squid • Squid – A publicly available web proxy cache software. http://www.squid-cache.org • Summary Cache is implemented in Squid v1.1.14 • A variation called cache digest is implemented in Squid 1.2b20

More Related