The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

The Original Google Paper Google is the common spelling of googol, or 10100, which fit well with the authors’ goal of building very large-scale search engines.

Outline Design goals System features System anatomy Results and performance Paper analysis

Design Goals

Design Goals Scale with the rapid growth of the web

Design Goals • Improved Search Quality • Number of documents on the web are increasing rapidly, but users’ ability to look at them lags. • Current search engines return lots of “junk” results, with little relevance. (Note: We’re talking about the year 1998) • Academic Search Engine Research • Push more development and understanding into the academic realm. • Systems that reasonable number of people can use. • Build an architecture to support novel research activities in large-scale web data.

System Features

System Features • Makes use of the link structure of the Web to calculate a quality ranking for each page, called the PageRank. • A probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. • It considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.

PageRank: Bringing Order to the Web • PR(A)  PageRank of a webpage A • PR(Ti)  PageRank of a webpage Ti pointing to A • C(Ti)  Number of outbound links for webpage Ti • L(A)  Set of webpages linking to A • d  damping factor, a value between 0 and 1, is the probability that a random surfer will stop clicking • Note that PageRanks form a probability distribution of webpages, so the summation of all webpages will be 1.

PageRank: Bringing Order to the Web • Assume a universe of 4 webpages: A, B, C, and D • Taking into consideration that a random surfer will eventually stop clicking, we assume a damping factor, d, which is generally assumed to be 0.85

System Features • Makes use of Anchor text of links on webpages: • E.g. <a href=http://www.yahoo.com>Yahoo!</a> • Text of a link is not only associated with the webpage it is on, it also gives information (sometimes more relevant) to the webpage it points to. • Anchors may exist for documents which generally cannot be indexed by text-based search engines, such as images, programs, and databases.

System Features Uses location information for all hits and thus makes extensive use of proximity in search. Keeps track of visual presentation of text on webpages such as font sizes. Words with bolder/larger font are given more importance. Stores complete raw HTML of webpages in repository.

System Anatomy

Major Data Structures • BigFiles • Virtual files spanning multiple file systems and addressable by 64 bit integers. • Repository • Contains full compressed HTML of all pages. • Stored one after another prefixed with docID, length, and URL. • Compressed using high speed compression technique (zlib) instead of high compression ratio (bzip).

Major Data Structures • Document Index • Keeps information about each document. • It’s a fixed width index, ordered by docID. • Stores document status, pointer into the repository, and checksum. • If document is indexed, points to a variable width file docinfo which contains URL and title. Else points to URLlist containing only the URL. • Lexicon • Contains list of null separated words (about 14 million) and hash table of pointers.

Major Data Structures • Hit Lists • A list of occurrences of a particular word in a particular document including position, font, and capitalization information. • Hit lists account for most of the space used in both the forward and the inverted indices. • Forward Index • Stored in a number of barrels. • If a document contains words that fall into a particular barrel, the docID is recorded into the barrel followed by a list of wordIDs with their hitlists.

Major Data Structures • Inverted Index • The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter.

Crawling the Web • Several distributed crawlers. • URLserver serves list of URLs to the crawler. • Each crawler keeps ~300 open connections. • At max, a system of 4 crawlers can crawl ~100 pages/sec or ~600 K/second of data. • Each maintains it’s own DNS cache for fast lookup. • Parser handles a huge array of possible errors including HTML errors, non-ASCII characters, or HTML tags nested hundreds deep

Indexing the Web • Indexing Documents into Barrels • After each document is parsed, it is encoded into a number of barrels. • Every word is converted into a wordID using an in-memory hash table – the lexicon. • Once words are converted into wordIDs, their occurrences in the current document are translated into hit lists and are written into the forward barrels. • Sorting • Sorter takes each of the forward barrels and sorts by wordID to produce an inverted barrel for title and anchor hits and full text inverted barrel.

Searching • Parse the query • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Results and Performance

Results and Performance A qualitative analysis of the search results by users has generally been positive. The current version of Google answers most queries in between 1 and 10 seconds. Since Google takes into consideration the proximity of word occurrences, results are more relevant than other search engines giving a set of results for all words in queries. (E.g. search for ‘bill clinton’ gives lower importance to results with independent ‘bill’ and ‘clinton’)

Future Works Current version of Google search times are dominated by disk IO. Introduce query caching, and hardware, software and algorithmic optimizations. Improve search efficiency and quickly scale to ~100 million web pages. Develop Google as a resource for large scale research tool for searchers and researchers.

Analyses of the Research Paper • Pros • One of the first descriptions of the PageRank algorithm which changed how search engines ranked and indexed the web. • Using citation graph and anchor text to rank pages closely resembled user behavior of ranking websites. • Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them. • The paper mentions Google does not compromise PageRanks for monetary gains giving more credibility to search results. This holds true to date.

Analyses of the Research Paper • Cons • One of the first flaws found in the PageRank algorithm was the “Google Bomb”: • Because of the PageRank, a page will be ranked higher if the sites that link to that page use consistent anchor text. • A Google bomb is created if a large number of sites link to the page in this manner. • Ranking quality is insufficient using only PageRank and anchor text. (Google today uses more than 200 different parameters to judge quality of a webpage.)

Thank You Presented by: Nilay Khandelwal

The Anatomy of a Large-Scale Hypertextual Web Search Engine