1 / 18

Mining Web’s Link Structure

Mining Web’s Link Structure. Sushanth Rai University of Texas at Arlington http://cseweb.uta.edu/~rai. Structure of WWW. Highly Decentralized Unstructured Hyperlink Based Disorganized Presentation. Searching the WWW.

talib
Download Presentation

Mining Web’s Link Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington http://cseweb.uta.edu/~rai

  2. Structure of WWW • Highly Decentralized • Unstructured • Hyperlink Based • Disorganized Presentation

  3. Searching the WWW • Searching : Process of discovering high quality relevant pages in response to specific need for certain information

  4. Challenges in Search Engines • Index based search engines returns one or million results !! • Heuristics used to rank the pages use frequency of occurrence of words • Spamming can mislead Index based search engines • Human language exhibits synonymy and polysemy • Web pages are not self descriptive

  5. Searching with Hyperlinks • Features • Hyperlinks represent latent human judgment • Hyperlinks provides opportunity to find potential authorities • Pitfalls • Links are created for purposes other than potential authorities • Balance between popularity and relevance

  6. Focused Subgraph of WWW • Authority : A page that is referred by many good hubs • Hub : A page that points to many good authorities • Authorities and hubs are extracted through focused subgraph which contain set of pages • Whose size is relatively small • Rich in content related to query • Contains strongest authorities

  7. root base

  8. Construction of Subgraph • Subgraph(, , t, d) •  : a query string •  : a text-based search engine • t, d : natural numbers. • Let Rdenote the top t results of  on  • Set S= R • For each page p  R • Let +(p) denote the set of all pages p points to. • Let -(p) denote the set of all pages pointing to p • Add all pages in +(p) to S. • If |-(p)| <= d then • Add all the pages in -(p) to S • Else • Add an arbitrary set of d pages from -(p) to S • End • Return S

  9. Pruning the Subgraph • In the graph G[S] induced by the set S • Identify the links that are transverse and intrinsic • Delete all the intrinsic links and retain only transverse links

  10. Computing Hubs and Authorities • Associate non-negative authority weight and non-negative hub weight with each page • Weights of each type are normalized so that squares sum to 1 • Use I and O operation iteratively to update the weights • I : x<p>q:(q,,p)  E y<q> • O : y<p>q:(p,,q)  E x<q>

  11. Hubs Unrelated page of Large in-degree Authorities

  12. Iterative Algorithm • Iterate(G,k) • G: a collection of n linked pages • K: a natural numbers • Let z denote the vector (1,1,1….1)  Rn • Set x0 = z • Set y0 = z • For j = 1,2, ….k • Apply the I operation to (xj-1, yj-1), obtaining new x-weights x’j • Apply the O operation to (x’j, yj-1), obtaining new y-weights y’j • Normalize x’j, obtaining xj. • Normalize y’j, obtaining yj. • End • Return(xk, yk)

  13. Results (java) Authorities .328 http://www.gamelan.com .251 http://java.sun.com .190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html .183 http://sunsite.unc.edu/javafaq/javafaq.html (Gates) Authorities .643 http://www.roadahead.com .458 http://www.microsoft.com .440 http://www.microsoft.com/corpinfo/bill-g.htm

  14. Results (Contd…) Comparative results with Altavista, Yahoo, Clever on 26 broad search topics rated as “bad”, “fair”, “good”, “fantastic” For 31%, Yahoo and Clever received equivalent evaluations For 50%, Clever received a higher evaluation For 19%, Yahoo received the higher evaluation Altavista failed to receive higher evaluation on any of the 26 topics.

  15. Applications • Constructing Taxonomies semiautomatically • Trawling the web for Emerging Cybercommunities • Mining structured information that succumbs to database techniques

  16. Web Resources • Clever - http://www.almaden.ibm.com/cs/k53/clever.html • Google - http : //www.google.com • WebL - http://www.research.compaq.com/SRC/WebL

  17. Questions ??

More Related