Mining Web’s Link Structure

Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington http://cseweb.uta.edu/~rai

Structure of WWW • Highly Decentralized • Unstructured • Hyperlink Based • Disorganized Presentation

Searching the WWW • Searching : Process of discovering high quality relevant pages in response to specific need for certain information

Challenges in Search Engines • Index based search engines returns one or million results !! • Heuristics used to rank the pages use frequency of occurrence of words • Spamming can mislead Index based search engines • Human language exhibits synonymy and polysemy • Web pages are not self descriptive

Searching with Hyperlinks • Features • Hyperlinks represent latent human judgment • Hyperlinks provides opportunity to find potential authorities • Pitfalls • Links are created for purposes other than potential authorities • Balance between popularity and relevance

Focused Subgraph of WWW • Authority : A page that is referred by many good hubs • Hub : A page that points to many good authorities • Authorities and hubs are extracted through focused subgraph which contain set of pages • Whose size is relatively small • Rich in content related to query • Contains strongest authorities

root base

Construction of Subgraph • Subgraph(, , t, d) •  : a query string •  : a text-based search engine • t, d : natural numbers. • Let Rdenote the top t results of  on  • Set S= R • For each page p  R • Let +(p) denote the set of all pages p points to. • Let -(p) denote the set of all pages pointing to p • Add all pages in +(p) to S. • If |-(p)| <= d then • Add all the pages in -(p) to S • Else • Add an arbitrary set of d pages from -(p) to S • End • Return S

Pruning the Subgraph • In the graph G[S] induced by the set S • Identify the links that are transverse and intrinsic • Delete all the intrinsic links and retain only transverse links

Computing Hubs and Authorities • Associate non-negative authority weight and non-negative hub weight with each page • Weights of each type are normalized so that squares sum to 1 • Use I and O operation iteratively to update the weights • I : x<p>q:(q,,p)  E y<q> • O : y<p>q:(p,,q)  E x<q>

Hubs Unrelated page of Large in-degree Authorities

Iterative Algorithm • Iterate(G,k) • G: a collection of n linked pages • K: a natural numbers • Let z denote the vector (1,1,1….1)  Rn • Set x0 = z • Set y0 = z • For j = 1,2, ….k • Apply the I operation to (xj-1, yj-1), obtaining new x-weights x’j • Apply the O operation to (x’j, yj-1), obtaining new y-weights y’j • Normalize x’j, obtaining xj. • Normalize y’j, obtaining yj. • End • Return(xk, yk)

Results (java) Authorities .328 http://www.gamelan.com .251 http://java.sun.com .190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html .183 http://sunsite.unc.edu/javafaq/javafaq.html (Gates) Authorities .643 http://www.roadahead.com .458 http://www.microsoft.com .440 http://www.microsoft.com/corpinfo/bill-g.htm

Results (Contd…) Comparative results with Altavista, Yahoo, Clever on 26 broad search topics rated as “bad”, “fair”, “good”, “fantastic” For 31%, Yahoo and Clever received equivalent evaluations For 50%, Clever received a higher evaluation For 19%, Yahoo received the higher evaluation Altavista failed to receive higher evaluation on any of the 26 topics.

Applications • Constructing Taxonomies semiautomatically • Trawling the web for Emerging Cybercommunities • Mining structured information that succumbs to database techniques

Web Resources • Clever - http://www.almaden.ibm.com/cs/k53/clever.html • Google - http : //www.google.com • WebL - http://www.research.compaq.com/SRC/WebL

Questions ??

Mining Web’s Link Structure