190 likes | 387 Views
Robust Hyperlinks. Thomas A. Phelps Robert Wilensky. Problem: Broken Links. Links dangle because resource has been deleted, renamed, moved, or otherwise changed. Proposed solutions: additional naming schemes (URNs, handles, Persistent Uniform Resource Locator (PURLs) or Common Names)
E N D
Robust Hyperlinks Thomas A. Phelps Robert Wilensky
Problem: Broken Links • Links dangle because resource has been deleted, renamed, moved, or otherwise changed. • Proposed solutions: • additional naming schemes (URNs, handles, Persistent Uniform Resource Locator (PURLs) or Common Names) • monitoring and notification to insure referential integrity (e.g., Ingham et al., (1996), Mind-it, Macskassy and Shklar, (1997), Francis et al., (1995)) • None have been widely adopted, probably because they depend on administrative buy-in. • Archiving the web (Alexa) would be a solution • if you believe a persistent complete archive scales • and are comfortable with all your errors archived forever • and have no data behind a firewall.
Proposed Solution: “Robust Hyperlinks” • Design links so that, even when the target moves (and mutates), one can still locate it with high probability. • Note that burden is on link creator, so that • no administrative buy-in is required • links can be made robust on a piecemeal basis.
Requirements for Robust Hyperlinks • Provide high likelihood of successful dereferencing when an item is moved, but largely unchanged. • Performance should degrade gracefully as document content changes from its state at the time the hyperlink was created. • Should not impose a performance penalty when not used. • Storage required for a robust hyperlink must be trivially small • so that it is practical to make all URLs robust. • Implementation in clients or via proxies should be straightforward so as to encourage widespread adoption. • Should be largely non-interfering with clients and services that do not support them. • Making a hyperlink robust should be cheap and fully automated. • An author should be able to point to a hyperlink or site, and have it automatically become robust.
General Idea: Enhance URLs with “signatures” • Add to a URL a “signature”, a small piece of document content. • When “traditional” (i.e., address-based) dereferencing fails, do “signature-(i.e., content-)based dereferencing: • Pass the signature to some search service, and hope that the target will be prominent among a very small result set. • Two issues: • Computing small, yet effective signatures • Adding them innocuously to hyperlinks
Computing Small, Effective Signatures • “Lexical” signatures: The top n words of a document according to TF-IDF measure. • Almost all the desiderata are obviously met. • Question is, how big a signature is needed to locate a document more or less uniquely on the Web? • Inktomi says there are approximately 1 billion web pages now.
Answer: 5 words! • I.e., a signature of 5 words will, in most cases, cause search engines to return the target document within the top few hits. • Actually, a smaller signature will probably do just to locate exact matches, but length helps provide robustness and for growth.
Some Examples • Signature for Randy Katz’s home page is • “Californa ISRG Culler rimmed gaunt” • Here is what happens when we feed this signature to HotBot:
Another Example • Signature for Endeavour home page is • “amplifies Endeavour leverages Charting Expedition” • Here is what happens when we feed this signature to Google:
Examples (con’t) In most cases, only one or two documents are returned by the more “stringent” searches.
Content-based Dereferencing Strategies • Query a set of engines, and return the top few results from each one. • In our sample set, each hyperlink is successfully dereferenced by this strategy. • Make "stringent" queries to one or more engines; if fails, make progressively less stringent queries. • Performing the Google query, and then performing the Alta Vista query if Google fails, locates the desired reference in all but one case.
Why Does This Work? • If there are 500,000 distinct terms in the Web, then the number of distinct combinations of 5 terms is greater than 3x10*28. • If the web is populated by documents whose most characteristic terms are uniformly drawn at random, the probability that more than one document matches a set of 5 characteristic terms is very small. • There is lots of room for randomness to be off. • Many documents contain very infrequently used words.
Augmenting URLs with Signatures • Robust URLs (incompatible) • Make a new syntax, a la XPointer. • Robust URLs (mostly compatible) http://www.something.dom/a/b/c?lexical-signature="w1+w2+w3+w4+w5" • Aware agents (client or proxy) strip before traditional dereferencing. • Turns out widely used servers mostly ignore • True for all of our examples • and for Apache (50% of the market), Microsoft Internet Information Server (24%), and Netscape Enterprise (7%) • and for many cgi-bin scripts, • Robust Link Elements <a href="http://www.something.dom/a/b/c" lexical-signature="w1+w2+w3+w4+w5">click here</a> • Completely transparent to unaware clients. • Not amenable to proxy agent, can’t easily pass around.
Support for Robust Hyperlinks • Ideally, support in browser. • Current version of MVD supports robust URLs. • Generate signatures and “robust reports”. • All bookmarks are automatically made robust. • Does signature-based dereferencing when traditional fails. • Robust Proxy Module • Strips signatures from robust URLs. • If the server returns error code 404, engages in signature-based dereferencing. • Can also sign unsigned URLs, redirect to client, so robust URL can be bookmarked, etc. • Robust Proxy Service http://www.myproxyserver.dom/cgi-bin?url="http://www.something.dom/a/b/c"&lex-signature=w1+w2+w3+w4+w5 • All agents can maintain URL remapping list. • Would be nice if search engines supported robustness in various ways.
Limitations • Non-indexed documents • behind firewalls, cgi-bin scripts • non-indexed data types • Non-textual resources • Resources with highly variable content • Duplicates • Variation in search engine performance
Extensions • Signature Creation • Examine signatures upon creation, improve. • Incrementally create signature to deal with changing content. • Signature Variation • To improve robustness, preclude • all signature terms from occurring in the same sentence • many words that appear to be misspellings (hard) • many words that occur only once in a document. • Signature Dereferencing Strategies • Compute signatures of documents in result set and match those. • Robust Hyperlink Agents • Perform signature-based dereferencing under some circumstances, even it traditional dereferencing succeeds. • Apply to other tasks, e.g., plagiarism detection.
Other Robust Reference Ideas • Robust Intra-document References • uses content of local context plus tree location • Robust hyperlinks work by building on top of other Internet resources (e.g., search engines). What other types of services might there be? (Your idea here.)