1 / 20

Detecting Near-Duplicates for Web Crawling

Detecting Near-Duplicates for Web Crawling. Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung. Overview . Introduction Previous Related Work Algorithm Evaluation Future Work Pros & Cons Comment Reference. Introduction – The drawback of duplicate pages.

raquel
Download Presentation

Detecting Near-Duplicates for Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presented by Yen-Yi Hung

  2. Overview • Introduction • Previous Related Work • Algorithm • Evaluation • Future Work • Pros & Cons • Comment • Reference

  3. Introduction – The drawback of duplicate pages • Waste network bandwidth • Affect refresh times • Impact politeness constraints • Increase storage cost • Affect the quality of search indexes • Increase the load on the remote host that is serving such web pages • Affect customer satisfaction

  4. Introduction – Challenge and Contribution of this paper Challenge: • Dealing with scale issue • Determining near-duplicates efficiently Contribution: • Showing that simhash could be used to deal with the huge amount query • Developing a way to solve Hamming Distance Problem quickly (for online single query or batch multi-queries)

  5. Previous Related Work Many related techniques are different when they deal with different corpus, end goals, feature sets, or signature schema. Corpus: • Web Documents • Files in a file system • E-mails • Domain-Specific corpora

  6. Previous Related Work (II) End Goals: • Web mirrors • Clustering for related documents query • data extraction • Plagiarism • spam detection • duplicates in domain-specific corpora Feature Sets: • Shingles from page content • Document vector from page content • Connectivity information • Anchor text, anchor window • Phrases

  7. Previous Related Work (III) Signature Schema: • Mod-p shingles • Min-hash for Jaccard similarity of sets • Signatures/fingerprints over IR-based document vectors • Checksums This paper focus on Web Documents. Its goal is to improve web crawling using Simhash technique.

  8. Algorithm – Simhash fingerprinting • What could Simhash do? Mapping high-dimensional vectors to small-sized fingerprints. • The atypical Simhash property Similar documents have similar hash values. • How to apply? Converting web pages to a set of weighted features (computed using standard IR techniques.)

  9. Algorithm – Hamming Distance Problem Hamming Distance Problem Given a collection of f-bit fingerprints and a query fingerprint F, we need to identify whether an existing fingerprint differs from F in at most k bits. But simply probing the fingerprint collection is impractical. So what should we do? 1. Build t Table T1, T2, …, Tt. Each table has an integer pi and a permutation Πi. 2. Apply permutation Πi to each existing fingerprint in each Table Ti and sort each Ti.

  10. Algorithm – Hamming Distance Problem 3. Given fingerprint F and an integer k which used to determine the hamming distance. We use the following 2 steps to solve hamming distance problem. Step 1:Find all permuted fingerprints in Ti whose top pi bit-positions match the top pi bit-positions of Πi(F) Step 2:For each fingerprints found in step 1, check if it differs from Πi(F) in at most k bit-posions. Time Complexity: • Step 1 can be done in O(pi) steps using binary search. • Step 2 can be shrink to O(log pi) steps using interpolation search.

  11. Algorithm – Compression of Fingerprints • Step1: The first fingerprint in the block is remembered in its entirety. • Step2: Get the most significant 1-bit in the XOR of two successive fingerprints, and we denote it as h. • Step3: Append the Huffman code of h to the block. • Step4: Append the bits to the right of the most-significant 1-bit to the block. • Step5: Repeat step 2,3,4 till a block (1024 bytes) is full

  12. Algorithm – Batch query implementation Both File F (existing fingerprints) and File Q (the batch of query fingerprints) are stored in a shared-nothing distributed file system GFS. The batch queries could be spilt into 2 phases • Phase 1: We solve the hamming distance problem over some chunks of F and the entire file Q as input. The outputs of the computation are near-duplicate fingerprints. • Phase 2: MapReduce will remove duplicates and produces a single sorted file according to the results of phase 1.

  13. Evaluation • Is simhash a reasonable technique when dealing with de-duplication issue? when choosing k=3, precision and recall ≒ 0.75 *According to the result of “Finding near-duplicate web pages: a large-scale evaluation of algorithms” by M. R. Henzinger in 2006, its precision and recall are around 0.8.

  14. Evaluation • Will the characteristic of simhash affect the results? If yes, then is it a significant impact? Fig 2(a): Right-half displays the specific distribution but not the Left-half. This is because some similar contents only have moderate difference in Simhash values. Fig2(b): Distribution has some spikes because of empty pages, file not found pages, and the similar login pages for some bulletin board software.

  15. Evaluation • 32GB batch queries fingerprints with 200 mappers, the combined rates could exceed 1GBps. • Given fixed number of mappers, the time taken is roughly proportional to the size of file Q. [Compression plays an important role.]

  16. Future Work Based on this paper: • Document size • Category information de-duplication • Near-duplication vs. Clustering Other Research topic: • More cost-effective approach of using just the URLs information for de-duplication

  17. Pros Pros: • Efficient and Practical • Using compression and specific database design (GFS) to solve the problem of fingerprint based de-duplication issues • Given a compact but thorough description of de-duplication related work

  18. Cons Cons: • Limit of accuracy -- not based on explicit content matching of the document but the possibility of similarity • This paper does not provide any evaluation results compared with other algorithm • Though providing compression techniques, the cost of space still remain questioned • Content-based de-duplication can only be implemented after the Web pages have been downloaded. So it does not help reduce the waste of bandwidth in crawling.

  19. Comment • This technique is good. It provides an efficient way of using Simhash to solve de-duplication issue for a large amount of data. Though not the first paper focusing on large amount of web pages, but it indeed provides actual query size in the real world.

  20. Reference • Paolo Ferragina , Roberto Grossi , Ankur Gupta , Rahul Shah , Jeffrey Scott Vitter, On searching compressed string collections cache-obliviously, Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 09-12, 2008, Vancouver, Canada • Hsin-Tsang Lee , Derek Leonard , Xiaoming Wang , Dmitri Loguinov, IRLbot: Scaling to 6 billion pages and beyond, ACM Transactions on the Web (TWEB), v.3 n.3, p.1-34, June 2009 • Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China • Anirban Dasgupta , Ravi Kumar , Amit Sasturkar, De-duping URLs via rewrite rules, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA • Lian'en Huang , Lei Wang , Xiaoming Li, Achieving both high precision and high recall in near-duplicate detection, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA • Edith Cohen , Haim Kaplan, Leveraging discarded samples for tighter estimation of multiple-set aggregates, Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, June 15-19, 2009, Seattle, WA, USA • Amit Agarwal , Hema Swetha Koppula , Krishna P. Leela , Krishna Prasad Chitrapura , Sachin Garg , Pavan Kumar GM , Chittaranjan Haty , Anirban Roy , Amit Sasturkar, URL normalization for de-duplication of web pages, Proceeding of the 18th ACM conference on Information and knowledge management, November 02-06, 2009, Hong Kong, China • Hema Swetha Koppula , Krishna P. Leela , Amit Agarwal , Krishna Prasad Chitrapura , Sachin Garg , Amit Sasturkar, Learning URL patterns for webpage de-duplication, Proceedings of the third ACM international conference on Web search and data mining, February 04-06, 2010, New York, New York, USA • M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006, pages 284-291, 2006.

More Related