1 / 9

Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure

Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure. -- Martin Klein & Michael L. Nelson Old Dominion University. Looks familiar?. “Moved” but not lost. Reasons for “404” Change in website structure Original webpage relocated in the same website

cale
Download Presentation

Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure -- Martin Klein & Michael L. Nelson Old Dominion University

  2. Looks familiar?

  3. “Moved” but not lost • Reasons for “404” • Change in website structure • Original webpage relocated in the same website • Server/domain name issues • Original webpage captured by other websites

  4. “Moved” but not lost

  5. Rediscovering Missing Webpages • Search-based solutions • URL • Lexical Signature (LS) • Title • Social bookmarking tags • Link NeighbourhoodLexical Signature (LNLS)

  6. Evaluation • Corpus • 500 random samples from Open Directory Project • “Pretend” to be missing • Search Engines: • Google/Yahoo/MSN • Metric • Percentage of webpages rediscovered from the top-N search results (N=1, 2-10, 11-100)

  7. Results • LS • Majority either rediscovered in top-10 or undiscovered • Yahoo!: 67.6% top-1, 7.5% top-2-10, 22% undiscovered • Title • Similar distribution but with more webpagesrediscovered • Google: 69.3% top-1, 8.1% top-2-10, 19.7% undiscovered • Unquoted better than quoted • Tags and LNLS • Poor performance from both

  8. Results • Combining LS and Title • Better performance than any single method • Yahoo! uniformly outperforms the rest • 76.4% top-1, 7.8% top-2-10, 13.6% undiscovered • Title analysis • Length of 3~6 words most frequent and well-performing • Further improvement by removing stopwords

  9. Research Insights • Common but non-trivial problem • Simple methodology • Detailed, multi-step evaluation

More Related