Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture

Search AlgorithmsWinter Semester 2004/200522 Nov 20046th Lecture Christian Schindelhauer schindel@upb.de

Chapter III Chapter III Searching the Web 22 Nov 2004

Searching the Web • Introduction • The Anatomy of a Search Engine • Google’s Pagerank algorithm • The Simple Algorithm • Periodicity and convergence • Kleinberg’s HITS algorithm • The algorithm • Convergence • The Structure of the Web • Pareto distributions • Search in Pareto-distributed graphs

Overview Search Engineshttp://www.searchengineshowdown.com/(March 2002) • Number of documents

Overview Search Engineshttp://www.searchengineshowdown.com/(Dez. 2002) • Number of documents

Problems of Searching the Web • Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages • 10.000 words cover more than 95% of each text • much more web-pages than words • Users hardly ever look through more than 40 results • The problem is not to find a pattern, but to find the most important pages • Problems: • Important pages do not contain the search pattern • www.porsche.com does not contain sports car or even car • www.google.com does not contain web search engine • www.airbus.com does not contain airplane • Certain pages have nearly every word (dictionary) • Names are misleading • http://www.whitehouse.org/ is not the web-site of the white house • www.theonion.com is not about vegetables • Certain pattern can be found everywhere, e.g. page, web, windows, ...

How to rank Web-pages • The main problem about searching the web is to rank the importance • Links are very helpful: • Humans are usually introduced on purpose • The context of the links gives some clues about the meaning of the web-page • Pages where many people point to are of probably very important • Most search rely on links • Other approach: Ontology of words • Compare the combination of words with the search word • Good for comparing text • Difficult if single word patterns are given

The Anatomy of a Web Search Engine • “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page, Computer Networks and ISDN Systems, Vol. 30, 1-6, p. 107-117, 1998 • Design of the prototype • Stanford University 1998 • Key components: • Web Crawler • Indexer • Pagerank • Searcher • Main difference between Google and other search engines (in 1998) • The Pagerank mechanism

Simplified PageRank-Algorithmus • Simplified PageRank-Algorithmus • Rank of a wep-page R(u)  [0,1] • Important pages hand their rank down to the pages they link to. • c is a normalisation factor such that ||R(u)||1= 1, i.e. • the sum of all page ranks add to 1 • Predecessor nodes Bu • sucessor nodes Fu

The Simplifed Pagerank Algorithm and an example

Matrix representaion R  c M R , where R is a vector (R(1),R(2),… R(n)) and M denotes the following n  n – Matrix

The Simplified Pagerank Algorithm • Does it converge? • If it converges, does it converge to a single result? • Is the result reasonable?

The Eigenvector and Eigenvalue of the Matrix • For vector x and n  n-matrix and a number λ: • If M x = λ x then x is called the eigenvector and λ the eigen-value • Every n  n-matrix M has at most n eigenvalues • Compute the eigenvalues by eigen-decomposition M x = λ x (M - I λ) x = 0, where I is the identity matrix • This equality has only non-trivial solutions if Det(M - I λ) = 0 • This leads to a polynomial equation of degree n, which has always n solutions λ1, λ2, ..., λn • (Fundamental theorem of algebra) • Solving the linear equations (M - I λi) x = 0 lead to the eigenvectors • The eigenvektor of the matrix is a fix point of the recursion of the simplified pagerank algorithm

Stochastic Matrices • Consider n discrete states and a sequence of random variable X1, X2, ... over this set of states • The sequence X1, X2, ... is a Markov chain if • A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix: • Elements of the matrix M must be real numbers of [0, 1]. • The sum of all column in M is 1 • Observation for the matrix M of the simpl. pagerank algorithm • M is stochastic if all nodes have at least one outgoing link

The Random Surfer • Consider the following algorithm • Start in a random web-page according to a probability distribution • Repeat the following for t rounds • If no link is on this page, exit and produce no output • Uniformly and randomly choose a link of the web-page • Follow that link and go to this web-page • Output the web-page Lemma The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x1, .., xn is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization. Proof follows applying the definition of Markov chains

Eigenvalues of Stochastic Matrices • Notations • Die L1-Norm of a vector x is defined as • x0, if for all i: xi  0 • x0, if for all i: xi  0 • Lemma For every stochastic matrix M and every vector x we have • || M x ||1 || x ||1 • || M x ||1= || x ||1, if x0 or x0  Eigenvalues of M |i|  1 • Theorem For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x  0 and ||x||1 = 1

The problem of periodicity - Example

Periodicity - Example 2

Periodic Matrices • Definition • A square matrix M such that the matrix power Mk=M for k a positive integer is called a periodic matrix. • If k is the least such integer, then the matrix is said to have period k. • If k = 1, then M2 = M and M is called idempotent. • Fact • For non-periodic matrices there are vectors x, such that limk Mk x does not converge. • Definition • The directed graph G=(V,E) of a n x n-matrix consistis of the node set V={1,..., n} and has edges • E = {(i,j) | Mij 0} • A path is a sequence of edges (u1,u2),(u2,u3),(u3,u4),..,(ut,ut+1) of a graph • A graph cycle is a path where the start node is the end node • A stronglyconnected subgraph S is a maximum sub-graph such that every graph cycle starting and ending in a node of S is contained in S.

Necessary and Sufficient Conditions for Periodicity • Theorem (necessary condition) • If the stochastic matrix M is periodic with period t2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i. • Theorem (sufficient condition) • Let the graph consist of one strongly connected subgraph and • let L1,L2, ..., Lm be the lengths all directed graph cycles of maximal length n • Then M is non-periodic if and only if gcd(L1,L2, ..., Lm) = 1 • Notation: • gcd(L1,L2, ..., Lm) = greatest common divisor of numbers L1,L2, ..., Lm • Corollary • If the graph is strongly connected and there exists a graph cycly of length 1 (i.e. a loop), then M is non-periodic.

Disadvantages of the Simplified Pagerank-Algorithm • The Web-graph has sinks, i.e. pages without links  M is not a stochastic matrix • The Web-graph is periodic  Convergence is uncertain • The Web-graph is not strongly connected Several convergence vectors possible • Rank-sinks • Strongly connected subgraphs absorb all weight of the predecessors • All predecessors pointing to a web-page loose their weight.

The (non-simplified) Pagerank-Algorithm • Add to a sink links to all web-pages • Uniformly and randomly choose a web-page • With some probability q < 1 perform a step of the simplified Pagerank algorithm • With probability 1-q start with the first step (and choose a random web-page) • Note M ist stochastic

Properties of the Pagerank-Algorithm • Graph der Matrix is strongly connected • There are graph cycles of length 1 Theorem In non-periodic matrices of strongly connected graphs the Markov-chain converges to a unique eigenvector with eigenvalue 1.  PageRank converges to this unique eigenvector

Thanks for your attentionEnd of 6th lectureNext lecture: Mo 29 Nov 2004, 11.15 am, FU 116Next exercise class: Mo 22 Nov 2004, 1.15 pm, F0.530 or We 24 Nov 2004, 1.00 pm, E2.316

Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture