Detecting Near-Duplicates for Web Crawling

Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung

Overview • Introduction • Previous Related Work • Algorithm • Evaluation • Future Work • Pros & Cons • Comment • Reference

Introduction – The drawback of duplicate pages • Waste network bandwidth • Affect refresh times • Impact politeness constraints • Increase storage cost • Affect the quality of search indexes • Increase the load on the remote host that is serving such web pages • Affect customer satisfaction

Introduction – Challenge and Contribution of this paper Challenge: • Dealing with scale issue • Determining near-duplicates efficiently Contribution: • Showing that simhash could be used to deal with the huge amount query • Developing a way to solve Hamming Distance Problem quickly (for online single query or batch multi-queries)

Previous Related Work Many related techniques are different when they deal with different corpus, end goals, feature sets, or signature schema. Corpus: • Web Documents • Files in a file system • E-mails • Domain-Specific corpora

Previous Related Work (II) End Goals: • Web mirrors • Clustering for related documents query • data extraction • Plagiarism • spam detection • duplicates in domain-specific corpora Feature Sets: • Shingles from page content • Document vector from page content • Connectivity information • Anchor text, anchor window • Phrases

Previous Related Work (III) Signature Schema: • Mod-p shingles • Min-hash for Jaccard similarity of sets • Signatures/fingerprints over IR-based document vectors • Checksums This paper focus on Web Documents. Its goal is to improve web crawling using Simhash technique.

Algorithm – Simhash fingerprinting • What could Simhash do? Mapping high-dimensional vectors to small-sized fingerprints. • The atypical Simhash property Similar documents have similar hash values. • How to apply? Converting web pages to a set of weighted features (computed using standard IR techniques.)

Algorithm – Hamming Distance Problem Hamming Distance Problem Given a collection of f-bit fingerprints and a query fingerprint F, we need to identify whether an existing fingerprint differs from F in at most k bits. But simply probing the fingerprint collection is impractical. So what should we do? 1. Build t Table T1, T2, …, Tt. Each table has an integer pi and a permutation Πi. 2. Apply permutation Πi to each existing fingerprint in each Table Ti and sort each Ti.

Algorithm – Hamming Distance Problem 3. Given fingerprint F and an integer k which used to determine the hamming distance. We use the following 2 steps to solve hamming distance problem. Step 1:Find all permuted fingerprints in Ti whose top pi bit-positions match the top pi bit-positions of Πi(F) Step 2:For each fingerprints found in step 1, check if it differs from Πi(F) in at most k bit-posions. Time Complexity: • Step 1 can be done in O(pi) steps using binary search. • Step 2 can be shrink to O(log pi) steps using interpolation search.

Algorithm – Compression of Fingerprints • Step1: The first fingerprint in the block is remembered in its entirety. • Step2: Get the most significant 1-bit in the XOR of two successive fingerprints, and we denote it as h. • Step3: Append the Huffman code of h to the block. • Step4: Append the bits to the right of the most-significant 1-bit to the block. • Step5: Repeat step 2,3,4 till a block (1024 bytes) is full

Algorithm – Batch query implementation Both File F (existing fingerprints) and File Q (the batch of query fingerprints) are stored in a shared-nothing distributed file system GFS. The batch queries could be spilt into 2 phases • Phase 1: We solve the hamming distance problem over some chunks of F and the entire file Q as input. The outputs of the computation are near-duplicate fingerprints. • Phase 2: MapReduce will remove duplicates and produces a single sorted file according to the results of phase 1.

Evaluation • Is simhash a reasonable technique when dealing with de-duplication issue? when choosing k=3, precision and recall ≒ 0.75 *According to the result of “Finding near-duplicate web pages: a large-scale evaluation of algorithms” by M. R. Henzinger in 2006, its precision and recall are around 0.8.

Evaluation • Will the characteristic of simhash affect the results? If yes, then is it a significant impact? Fig 2(a): Right-half displays the specific distribution but not the Left-half. This is because some similar contents only have moderate difference in Simhash values. Fig2(b): Distribution has some spikes because of empty pages, file not found pages, and the similar login pages for some bulletin board software.

Evaluation • 32GB batch queries fingerprints with 200 mappers, the combined rates could exceed 1GBps. • Given fixed number of mappers, the time taken is roughly proportional to the size of file Q. [Compression plays an important role.]

Future Work Based on this paper: • Document size • Category information de-duplication • Near-duplication vs. Clustering Other Research topic: • More cost-effective approach of using just the URLs information for de-duplication

Pros Pros: • Efficient and Practical • Using compression and specific database design (GFS) to solve the problem of fingerprint based de-duplication issues • Given a compact but thorough description of de-duplication related work

Cons Cons: • Limit of accuracy -- not based on explicit content matching of the document but the possibility of similarity • This paper does not provide any evaluation results compared with other algorithm • Though providing compression techniques, the cost of space still remain questioned • Content-based de-duplication can only be implemented after the Web pages have been downloaded. So it does not help reduce the waste of bandwidth in crawling.

Comment • This technique is good. It provides an efficient way of using Simhash to solve de-duplication issue for a large amount of data. Though not the first paper focusing on large amount of web pages, but it indeed provides actual query size in the real world.

Reference • Paolo Ferragina , Roberto Grossi , Ankur Gupta , Rahul Shah , Jeffrey Scott Vitter, On searching compressed string collections cache-obliviously, Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 09-12, 2008, Vancouver, Canada • Hsin-Tsang Lee , Derek Leonard , Xiaoming Wang , Dmitri Loguinov, IRLbot: Scaling to 6 billion pages and beyond, ACM Transactions on the Web (TWEB), v.3 n.3, p.1-34, June 2009 • Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China • Anirban Dasgupta , Ravi Kumar , Amit Sasturkar, De-duping URLs via rewrite rules, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA • Lian'en Huang , Lei Wang , Xiaoming Li, Achieving both high precision and high recall in near-duplicate detection, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA • Edith Cohen , Haim Kaplan, Leveraging discarded samples for tighter estimation of multiple-set aggregates, Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, June 15-19, 2009, Seattle, WA, USA • Amit Agarwal , Hema Swetha Koppula , Krishna P. Leela , Krishna Prasad Chitrapura , Sachin Garg , Pavan Kumar GM , Chittaranjan Haty , Anirban Roy , Amit Sasturkar, URL normalization for de-duplication of web pages, Proceeding of the 18th ACM conference on Information and knowledge management, November 02-06, 2009, Hong Kong, China • Hema Swetha Koppula , Krishna P. Leela , Amit Agarwal , Krishna Prasad Chitrapura , Sachin Garg , Amit Sasturkar, Learning URL patterns for webpage de-duplication, Proceedings of the third ACM international conference on Web search and data mining, February 04-06, 2010, New York, New York, USA • M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006, pages 284-291, 2006.

Detecting Near-Duplicates for Web Crawling