Using Bloom Filters to Refine Web Search Results

Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at Austin Renu Tewari IBM Almaden Research Center Department of Computer Sciences, UT Austin

www.delorie.com/gnu/docs/emacs/emacs_toc.html www.cs.utah.edu/dept/old/texinfo/emacs19/emacs_toc.html www.dc.turkuamk.fi/docs/gnu/emacs/emacs_toc.html www.linuxselfhelp.com/gnu/emacs/html chapter/emacs_toc.html • 29.2% of data common across 150 million pages (Fetterly’03, Broder’97) Motivation • Google, Yahoo, MSN: Significant fraction of near-duplicates in top search results • Google “emacs manual” query • 7 of 20 results redundant • 3 identical pairs • 4 similar to one document • Similar results for Yahoo, MSN, A9 search engines Department of Computer Sciences, UT Austin

Problem Statement • Goal: Filter near-duplicates in web search results • Given a query search results, identify pages that are either • Highly similar in content (and link structure) • Contained in another page (Inclusions with small changes) • Key Constraints • Low Space Overhead: • Use only a small amount of information per document • Low Time Overhead: (latency unnoticeable to end-user) • Perform fast comparison and matching of documents Department of Computer Sciences, UT Austin

Our Contributions • A novel similarity detection technique using content-defined chunking and Bloom filters to refine web search results • Satisfies key requirements • Compact Representation • Incurs only about 0.4% extra bytes per document • Quick Matching • 66 ms for top-80 search results • Document similarity using bit-wise AND of their feature representations • Easy Deployment • Attached as a filter over any search engine’s result set Department of Computer Sciences, UT Austin

Talk Outline • Motivation • Our approach • System Overview • Bloom Filters for Similarity Testing • Experimental Evaluation • Related Work and Conclusions Department of Computer Sciences, UT Austin

Features > Similarity threshold Feature-set: small space, low complexity Fast approximate comparison Documents System Overview • Applying similarity detection to search engines • Crawl time: The web crawler • Step 1: fetches a page and indexes it • Step 2: computes and stores per-page features • Search time: The search-engine (or end user’s browser) • Step 1: Retrieve the top results’ meta-data for a given query • Step 2: Similarity Testing to filter highly similar results Similarity Testing C Department of Computer Sciences, UT Austin

Fast approximate comparison Feature-set: small space, low complexity C Markers Documents Feature Extraction and Similarity Testing (1) > Similarity threshold • Divide a file into variable-sized blocks (called chunks) • Use Rabin fingerprint to compute block boundaries • SHA-1 hash of each chunk as its feature representation Content-defined Chunking Original Document 4 5 6 3 chunk 1 2 chunk 1 2' 3 4 5 6 Modified Document Data inserted Department of Computer Sciences, UT Austin

Fast approximate comparison Feature-set: small space, low complexity C Content-defined Chunking Bloom filter generation Markers y X 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 Insert(y,S) Insert(x,S) … Documents … Feature Extraction and Similarity Testing (2) > Similarity threshold • A Bloom filter is an approximate set representation • An array of m bits (initially 0) • k independent hash functions • Supports • Insert (x,S) • Member (y,S) Document SHA-1 Department of Computer Sciences, UT Austin

Fast approximate comparison Feature-set: small space, low complexity C Content-defined Chunking Similarity Testing Markers X 0 0 1 0 1 0 0 0 1 0 Documents 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 Feature Extraction and Similarity Testing (3) > Similarity threshold A Document Bit-wise AND B SHA-1 0 0 1 0 1 0 1 0 1 0 Bloom filter generation A /\ B … 75% of A’s set bits matched Department of Computer Sciences, UT Austin

Proof-of-concept examples: Differentiate between multiple similar documents • IBM site (http://www.ibm.com) Dataset • 20 MB (590 documents) • /investor/corpgoverance/index.phtml compared with all pages • Similar pages (same base URL) • cgcoi.phtml (53%) • cgblaws.phtml (69%) • CVS Repository Dataset • Technical doc. file (17 KB) • Extracted 20 consecutive versions from the CVS • foo  original document • foo.1  first modified version • foo.19  last version Department of Computer Sciences, UT Austin

Talk Outline • Motivation • Our approach • System Overview • Bloom Filters for Similarity Testing • Experimental Evaluation • Related Work and Conclusions Department of Computer Sciences, UT Austin

Evaluation (1): Effect of degree of similarity • “emacs manual” query on Google • 493 results retrieved using GoogleAPI • Fraction of duplicates • 88% (50% similarity), 31% (90% similarity) • Larger Aliasing of higher ranked documents • Initial result set repeated more frequently in later results • Similar results observed for other queries 100 50% similar 80 60 Percentage of Duplicate Documents 80% similar 70% similar 60% similar 40 20 90% similar 0 0 100 200 300 400 500 Number of Top Search Results Retrieved Department of Computer Sciences, UT Austin

Evaluation (2): Effect of Search Query Popularity 80 400 300 "jon stewart crossfire" "republican national convention”” "hawking black hole bet" 60 300 "day of the dead" "national hurricane center" 200 "electoral college" Percentage of Duplicate Documents 40 200 100 "Olympics 2004 doping" 20 100 "x prize spaceship" "indian larry" 0 0 0 0 100 200 300 400 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of Top Search Results Retrieved Department of Computer Sciences, UT Austin

Evaluation (3): Analyzing Response Times • Top-80 search results for “emacs manual query” • Offline Computation time (pre-computed and stored) • CDC chunks 80 * 0.3 ms • Bloom filters generation 80 * 14 ms • Online Matching Time • Bit-wise AND of two Bloom filters (4 s) • Matching and Clustering time 66 ms + Total (offline + online) 1210 ms Online Time 66 ms Department of Computer Sciences, UT Austin

Selected Related Work • Most prior work based on shingling (many variants) • Basic idea: (Broder’97) • Divide document into k-shingles: all k consecutive words/tokens • Represent document by shingle-set • Shingle-sets intersection large  near-duplicate documents • Reduce similarity detection problem to set intersection • Differences with our technique: • Document similarity based on feature set intersection • Higher feature-set computational overhead • Feature set size dependent on sampling (Mins, Modm, etc.) Department of Computer Sciences, UT Austin

Conclusions • Problem: Highly similar matches in search results • Popular Search engines (Google, Yahoo, MSN) • Significant fraction of near-duplicates in top results • Adversely affects query search performance • Our Solution: A similarity detection technique using CDC and Bloom filters • Incurs small meta-data overhead • 0.4% bytes per document • Performs fast similarity detection • Bit-wise AND operations; order of ms • Easily deployed as a filter over any search engine’s results Department of Computer Sciences, UT Austin

For more information: http://www.cs.utexas.edu/users/nav/ Department of Computer Sciences, UT Austin

Using Bloom Filters to Refine Web Search Results

Using Bloom Filters to Refine Web Search Results

Presentation Transcript

Clustering Web Search Results

Clustering Web Search Results

Bloom Filters

Bloom filters

Bloom Filters

Bloom Filters

Bloom Filters

Improving Web Search Results Using Affinity Graph

Improving Web Search Results Using Affinity Graph

Refine and Analyze Search Results

Learning to Cluster Web Search Results.

Fast Packet Classification Using Bloom filters

Improving State Coverage Using Bloom Filters

Bloom Filters

Bloom Filters

Using BLAST options to refine a search

Using Web Search Methods Refining Results

Using BLAST options to refine a search

Deep Packet Inspection Using Parallel Bloom Filters

Bloom Filters

Bloom filters

Bloom Filters