1 / 17

Using Bloom Filters to Refine Web Search Results

Using Bloom Filters to Refine Web Search Results. Navendu Jain Mike Dahlin University of Texas at Austin Renu Tewari IBM Almaden Research Center. www.delorie.com/gnu/docs/emacs/emacs_toc.html. www.cs.utah.edu/dept/old/texinfo/emacs19/emacs_toc.html.

sawyer
Download Presentation

Using Bloom Filters to Refine Web Search Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at Austin Renu Tewari IBM Almaden Research Center Department of Computer Sciences, UT Austin

  2. www.delorie.com/gnu/docs/emacs/emacs_toc.html www.cs.utah.edu/dept/old/texinfo/emacs19/emacs_toc.html www.dc.turkuamk.fi/docs/gnu/emacs/emacs_toc.html www.linuxselfhelp.com/gnu/emacs/html chapter/emacs_toc.html • 29.2% of data common across 150 million pages (Fetterly’03, Broder’97) Motivation • Google, Yahoo, MSN: Significant fraction of near-duplicates in top search results • Google “emacs manual” query • 7 of 20 results redundant • 3 identical pairs • 4 similar to one document • Similar results for Yahoo, MSN, A9 search engines Department of Computer Sciences, UT Austin

  3. Problem Statement • Goal: Filter near-duplicates in web search results • Given a query search results, identify pages that are either • Highly similar in content (and link structure) • Contained in another page (Inclusions with small changes) • Key Constraints • Low Space Overhead: • Use only a small amount of information per document • Low Time Overhead: (latency unnoticeable to end-user) • Perform fast comparison and matching of documents Department of Computer Sciences, UT Austin

  4. Our Contributions • A novel similarity detection technique using content-defined chunking and Bloom filters to refine web search results • Satisfies key requirements • Compact Representation • Incurs only about 0.4% extra bytes per document • Quick Matching • 66 ms for top-80 search results • Document similarity using bit-wise AND of their feature representations • Easy Deployment • Attached as a filter over any search engine’s result set Department of Computer Sciences, UT Austin

  5. Talk Outline • Motivation • Our approach • System Overview • Bloom Filters for Similarity Testing • Experimental Evaluation • Related Work and Conclusions Department of Computer Sciences, UT Austin

  6. Features > Similarity threshold Feature-set: small space, low complexity Fast approximate comparison Documents System Overview • Applying similarity detection to search engines • Crawl time: The web crawler • Step 1: fetches a page and indexes it • Step 2: computes and stores per-page features • Search time: The search-engine (or end user’s browser) • Step 1: Retrieve the top results’ meta-data for a given query • Step 2: Similarity Testing to filter highly similar results Similarity Testing C Department of Computer Sciences, UT Austin

  7. Fast approximate comparison Feature-set: small space, low complexity C Markers Documents Feature Extraction and Similarity Testing (1) > Similarity threshold • Divide a file into variable-sized blocks (called chunks) • Use Rabin fingerprint to compute block boundaries • SHA-1 hash of each chunk as its feature representation Content-defined Chunking Original Document 4 5 6 3 chunk 1 2 chunk 1 2' 3 4 5 6 Modified Document Data inserted Department of Computer Sciences, UT Austin

  8. Fast approximate comparison Feature-set: small space, low complexity C Content-defined Chunking Bloom filter generation Markers y X 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 Insert(y,S) Insert(x,S) … Documents … Feature Extraction and Similarity Testing (2) > Similarity threshold • A Bloom filter is an approximate set representation • An array of m bits (initially 0) • k independent hash functions • Supports • Insert (x,S) • Member (y,S) Document SHA-1 Department of Computer Sciences, UT Austin

  9. Fast approximate comparison Feature-set: small space, low complexity C Content-defined Chunking Similarity Testing Markers X 0 0 1 0 1 0 0 0 1 0 Documents 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 Feature Extraction and Similarity Testing (3) > Similarity threshold A Document Bit-wise AND B SHA-1 0 0 1 0 1 0 1 0 1 0 Bloom filter generation A /\ B … 75% of A’s set bits matched Department of Computer Sciences, UT Austin

  10. Proof-of-concept examples: Differentiate between multiple similar documents • IBM site (http://www.ibm.com) Dataset • 20 MB (590 documents) • /investor/corpgoverance/index.phtml compared with all pages • Similar pages (same base URL) • cgcoi.phtml (53%) • cgblaws.phtml (69%) • CVS Repository Dataset • Technical doc. file (17 KB) • Extracted 20 consecutive versions from the CVS • foo  original document • foo.1  first modified version • foo.19  last version Department of Computer Sciences, UT Austin

  11. Talk Outline • Motivation • Our approach • System Overview • Bloom Filters for Similarity Testing • Experimental Evaluation • Related Work and Conclusions Department of Computer Sciences, UT Austin

  12. Evaluation (1): Effect of degree of similarity • “emacs manual” query on Google • 493 results retrieved using GoogleAPI • Fraction of duplicates • 88% (50% similarity), 31% (90% similarity) • Larger Aliasing of higher ranked documents • Initial result set repeated more frequently in later results • Similar results observed for other queries 100 50% similar 80 60 Percentage of Duplicate Documents 80% similar 70% similar 60% similar 40 20 90% similar 0 0 100 200 300 400 500 Number of Top Search Results Retrieved Department of Computer Sciences, UT Austin

  13. Evaluation (2): Effect of Search Query Popularity 80 400 300 "jon stewart crossfire" "republican national convention”” "hawking black hole bet" 60 300 "day of the dead" "national hurricane center" 200 "electoral college" Percentage of Duplicate Documents 40 200 100 "Olympics 2004 doping" 20 100 "x prize spaceship" "indian larry" 0 0 0 0 100 200 300 400 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Top Search Results Retrieved Department of Computer Sciences, UT Austin

  14. Evaluation (3): Analyzing Response Times • Top-80 search results for “emacs manual query” • Offline Computation time (pre-computed and stored) • CDC chunks 80 * 0.3 ms • Bloom filters generation 80 * 14 ms • Online Matching Time • Bit-wise AND of two Bloom filters (4 s) • Matching and Clustering time 66 ms + Total (offline + online) 1210 ms Online Time 66 ms Department of Computer Sciences, UT Austin

  15. Selected Related Work • Most prior work based on shingling (many variants) • Basic idea: (Broder’97) • Divide document into k-shingles: all k consecutive words/tokens • Represent document by shingle-set • Shingle-sets intersection large  near-duplicate documents • Reduce similarity detection problem to set intersection • Differences with our technique: • Document similarity based on feature set intersection • Higher feature-set computational overhead • Feature set size dependent on sampling (Mins, Modm, etc.) Department of Computer Sciences, UT Austin

  16. Conclusions • Problem: Highly similar matches in search results • Popular Search engines (Google, Yahoo, MSN) • Significant fraction of near-duplicates in top results • Adversely affects query search performance • Our Solution: A similarity detection technique using CDC and Bloom filters • Incurs small meta-data overhead • 0.4% bytes per document • Performs fast similarity detection • Bit-wise AND operations; order of ms • Easily deployed as a filter over any search engine’s results Department of Computer Sciences, UT Austin

  17. For more information: http://www.cs.utexas.edu/users/nav/ Department of Computer Sciences, UT Austin

More Related