1 / 24

IRLbot: Scaling to 6 Billion Pages and Beyond

IRLbot: Scaling to 6 Billion Pages and Beyond. Presented by rohit  tummalapalli                   sashank jupudi. Agenda. Introduction   Challenges Overcoming the challenges Scalability Reputation and spam Politeness Experimental results Conclusion. Introduction

malini
Download Presentation

IRLbot: Scaling to 6 Billion Pages and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit  tummalapalli                   sashank jupudi

  2. Agenda • Introduction   • Challenges • Overcoming the challenges • Scalability • Reputation and spam • Politeness • Experimental results • Conclusion

  3. Introduction • Search engines consist of two fundamental components • web crawlers • data miners • What is a web crawler? • Challenges • scalability • spam avoidance • politeness 

  4. Challenges - Scalability • Internet trade off between scalability(N) , performance(S) and resource usage(E). • Previous works? • Our goal

  5. Challenges - spam avoidance • Crawler is “bogged down” in synchronized delay attacks of certain spammers • Prior research? • Our goal

  6. Challenges - politeness • Web crawlers often get in trouble with webmasters for slowing down their servers. • Even with spam avoidance, the entire RAM eventually gets filled with URLs from a small set of hosts and the crawler simply chokes on its politeness. • Previous algorithms: No • Our goal

  7. Scalability • Bottleneck caused by complexity of verifying uniqueness of        URL's and compliance with robots.txt . • Need for an algorithm that can perform very fast checks for    large N • Authors introduced DRUM(Disk Repository with Update    Management) •DRUM can store large volumes of arbitrary data on disk to implement very fast checks using bucket sort. •Efficient and faster than prior disk based methods 

  8. Disk Repository with Update Mngmt •The purpose of DRUM is to allow for efficient storage of large collections of <key, value> pairs ━Key is a unique identifier of some data ━Value is arbitrary information attached to keys  •Three supported operations    check, update, check+update

  9. cont.. • input : Continuous tuples of <key,value,aux> •DRUM spreads <key,value> pairs in K disk buckets according to key. •All buckets pre-allocated on disk before usage.

  10. cont.. •Once largest bucket reaches size r < R foll. process repeated for all i = 1..k 1) Bucket QiH read into bucket buffer 2) Disk Cache Z sequentially read and compared with bucket buffer. 3) Updating disk cache Z

  11. DRUM used for storing crawler data DRUM objects  •URLseen     check+update operation •RobotsCache     check and update operations     caching robots.txt file •RobotsRequested     stores hashes of sites for which robots.txt been requested

  12. Organisation of IRLbot • URLseen discards duplicate URL's , unique one's are sent for BUDGET and STAR check. •passing or failing based on matching Robots.txt •sent back to Qr and host names passed through RobotsReq •Sites hashes not already present are fed into Qd which performs DNS lokups

  13. Overhead metric ω: # of bytes written to/read from disk during uniqueness checks of lN URLs      ━α is the number of times they are written to/read from disk     ━blN is the number of bytes in all parsed URLs • Maximum download rate (in pages/s) supported by the disk portion of URL uniqueness

  14. spam avoidance • BFS is a poor technique in presence of spam farms and infinite webs. • Simply restricting the branching factor or the maximum number of pages/hosts per domain is not a viable solution. • Computing traditional PageRank for each page could be prohibitively expensive in large crawls. • Spam can be effectively deterred by budgeting the number of allowed pages per pay-level domain (PLD).

  15. cont.. • This solution we call Spam Tracking and Avoidance through Reputation (STAR). • Each PLD x has budget Bx that represents the number of pages that are allowed to pass from x (including all subdomains) to crawling threads every T time units. • Each PLD x starts with a default budget B0 , which is then dynamically adjusted as its in-degree dx changes over time • DRUM is used to store PLD budgets and aggregate PLD-PLD link information.

  16. Politeness • Prior work has only enforced a certain per-host access delay τh seconds     ━Easy to crash servers that co-locate 1000s of virtual hosts. • To admit URLs into RAM we have a method called Budget Enforcement with Anti-Spam Tactics (BEAST). • BEAST does not discard URLs, but rather delays their download until it knows more about the PLD they belong to.

  17. cont.. • A naive implementation is to maintain two queues      ━Q contains URLs that passed the budget check     ━QF contains those that failed • After Q is emptied, QF is read and again split into two queues – Q and QF & checks at regular increasing intervals. • For N →∞, disk speed λ → 2Sb(2L —1)= constant • It is roughly four times the speed needed to write all unique URLs to disk as they are discovered during the crawl

  18. Experiments - summary • Active crawling period of 41 days in summer 2007. • IRLbot attempted 7.6 billion connections and received 7.4 billion valid HTTP replies. • Average download rate 319 mb/s (1,789 pages/s).

  19. Conclusion • This paper tackled the issue of scaling web crawlers to billions and even trillions of pages     ━Single server with constant CPU, disk, and memory speed • We identified several bottlenecks in building an efficient large-scale crawler and presented our solution to these problems     ━Low-overhead disk-based data structures    ━Non-BFS crawling order    ━Real-time reputation to guide the crawling rate

  20. Future work • Refining reputation algorithms, accessing their performance. • Mining the collected data.

More Related