1 / 27

Detecting Sequences and Cycles of Web Pages

Detecting Sequences and Cycles of Web Pages. Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata. Contents. Introduction Objective Significance Procedure Experiments Future directions. The Web: A Directed Graph. (V, A) Vertices  Web pages

candy
Download Presentation

Detecting Sequences and Cycles of Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

  2. Contents • Introduction • Objective • Significance • Procedure • Experiments • Future directions

  3. The Web: A Directed Graph • (V, A) • Vertices  Web pages • V = {v1, v2, …, vN} • Arcs  Hyperlinks • A = {eij : vj  vi} • Path: p1.p2. … .pn with arcs from pi to pi+1 • Cycle: A Path with pn = p1

  4. Sequences of Web Pages • Paths consisting of adjacent web pages • Order sensitive • A surfer may follow one such sequence when browsing pages

  5. Cycles of Web Pages • http://www.stanford.edu/ • http://www.stanford.edu/home/atoz/letterw.html • http://www.stanford.edu/group/wellspring/ • http://www.stanford.edu/group/wellspring/yahoo_spotlight.html • http://www.yahoo.com/ • http://dir.yahoo.com/Education/ • http://dir.yahoo.com/Education/Higher_Education/ • http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/ • http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/United_States/ • http://www.stanford.edu

  6. What are we looking for ? • A particular kind of sequences and cycles • Regular • Consisting of similar units • Units having similar relationship • Reasonably sized

  7. Why are these Sequences and Cycles Interesting ? • Individual units form a single object • These were intended to be together • They collectively include the complete information • Despite being part of a collection, individuality is maintained

  8. Significance of Detecting Such Sequences and Cycles • Compression • Merge groups of pages • Fewer pages  fewer links • Pre-fetching • Know where the surfer wants to be next • Fetch the page(s) before being requested • Saves time • Errors: pre-fetching wrong pages

  9. Significance of Detecting Such Sequences and Cycles (Contd.) • Fair comparison • Comparison independent of how content is presented • Content split into multiple pages should be treated equivalent to the same in a single page • Better retrieval • Retrieval independent of the presentation • Output a set of pages instead of a single one as a match

  10. Fair Comparison

  11. Fair Comparison

  12. Fair Comparison

  13. Improved Retrieval • Retrieve only portions of interest • Instead of, whole (huge) documents • Avoid rewarding more content

  14. How to Detect Sequences and Cycles of Web Pages ? • Find navigational links • Find consecutive pages • Define what the elements of the sequence would satisfy • Identify subsequences (or units) • Concatenate • Check for cycles

  15. Finding Navigational Links: Background • The purpose of a link may be • Navigation • Reference • Advertisement • Links between pages on the same server are treated as navigational • Have also been treated as noise

  16. Finding Navigational Links: Our Method • Avoid treating links on the same server as navigational links • Appear mostly either at the top or at the bottom • Navigational links are generally huddled together • Fewer text and images around such links

  17. Advantages and Limitations • Simple and fast • Navigational links across servers are also identified • Heuristics need not always work – fall back on sophisticated methods

  18. Units of the Sequences • ABC is a unit if C is “related” to B in the same way as B is “related” to A • “related” is defined in terms of how they are linked • Relation is stored as “position” of the link • Several ways of defining “position”

  19. Combining the units into sequences • DEF • BCD • ABC • CDE • ABCDEF

  20. Cycle detection • Existing cycle detection algorithms • Cycle detection in number theory • Special case of cycle detection in graph theory • Stack based algorithm

  21. Improvements and Speedups • Believe the “rel” information provided by the (author of the) pages • Use keywords like “next” and “previous” to perceive the relationships • Utilize the information of the naming convention

  22. Experimental Results • Data • Toy data: python tutorial in HTML • Tutorial split into several chapters and sections • Several cycles • Mutilated data • Certain pages deleted (missing links) • 100% detection in all cases

  23. Other experiments planned • Real test: unorganized web pages • Difficulties: • Finding navigational links • Noise (advertisements, etc) • Dynamically generated • Will the relationships hold ?

  24. Leads us to … • Concatenate detected sequences for analysis • Modify retrieval mechanism • Return sets of pages as results • Improve mirror/duplicate detection

  25. Future Work • Consider other relations • Unifying framework ? • Improve identification of navigational links

  26. Thank You

More Related