Web Storage: Permanence of Information

Web Storage: Permanence of Information By Laura Milodin IST 497

Overview • Introduction • Problems • Solutions • Some alternatives • Conclusion

Introduction • Through publication we preserve and transmit our knowledge and culture. • Electronic media offers clear advantage to transmit knowledge. • However preserving our knowledge offers some new challenges.

The Problem • At some time in the future, we may want some information that we have today. • We want that information to be efficiently available to us in the future.

Web permanence • Narrow sense -- we have control of a particular piece of web content which we want to remain accessible to Web users. • Broader sense -- we want to save everything of value on the Web in order to preserve our culture. • The information presented today is looking at the problem in the narrow sense of Web permanence.

Actions we might have to take 1. We have to protect from unexpected disaster. 2. We have to protect from known slower acting deterioration. 3. We have to keep the content accessible to users. 4. We have to maintain not distort the importanceof the content stored.

“Importance” of the content • Whatever value society places on something should not be affected by how we store it. • We might however store something differently because of it’s importance. • The storage process itself should do nothing that would prevent recovery of important content in preference to less important.

Distortion of “importance” • Hardware needed for access may become less available over time, thus favoring content that is read by newer equipment. • Some content might be easier to access than other more important content. • Accidental discovery of some content might be much more likely than for some more important content.

Solutions • Intermemory is a noncommercial, decentralized concept using a scheme similar to barter. • An alternative to commercial archives or archival services offered by large libraries.

What is Intermemory– barter? • There is no central organization, rather each subscriber donates a certain amount of storage space for a limited time and in return receives the right to archive a much smaller amount. • Hard numbers are difficult to calculate due to many parameters.

Intermemory basics • The Intermemory is a very large, distributed, self-organizingmemory consisting of the combined memory of all subscribers that is addressed by a single addressing scheme. • The addresses correspond to blocks of N words each of w bits each.

How it Works? • Redundancy and dispersal provide the protection from unexpected disaster. • The rebuilding of data from a discontinued subscriber by a new subscriber provides the updating of equipment for the whole system. If one processor fails, its data is reconstructed automatically by a new or other existing subscriber.

Redundancy and Dispersal • A particular data block exists in its entirety only at a single processor. • The address of this block is used to retrieve the data under normal circumstances. • The portions of the data in this block are also dispersed among many other processors, this is used to rebuild the original data in case of failure.

Space-optimal dispersal • The mathematics behind the dispersal algorithm involve polynomial evaluation and interpolation. It is based on the idea of associating every block of N words with a polynomial of degree N-1. The value assumed by this polynomial at N distinct points would uniquely identify it.

Retrieval • We calculate more points than we need, say 2*N points. • We disperse these 2*N points among many processors. • If we lose the original word block, we only need to recover N out of possible 2*N points to recover the polynomial and find the original block that corresponds to it.

Space requirement • Looking at space requirements, at the first level of replication, each dispersal level takes twice the original block size. At the second level of replication takes four times the original block size. • Total space requirement is 9 times the original.

Degree of dispersal • The degree of maximum dispersal is a key variable here. • Dispersal on the scale proposed in the former slide would not be needed for a model where processor failures were independent. • This model assumes possibility of software, bugs, viruses, & overt adversarial action.

Uniform Resource Locators • Another alternative is the URL. • The goal here is not to physically maintain hardware, or a readable copy of content on that hardware, but rather to maintain a link. • We are assuming that a readable copy exists & we want to be able to link to it even when its physical location or file structure changes.

Uniform Resource Name, URN • The general solution to this problem is the development of Uniform Resource Names or URNs. • A parallel situation exists with books in a library. We could attempt to describe a book as on the fourth floor, third aisle from west end, top shelf, second from end. Or we could give the book a name and maintain some way of resolving the name into its location.

Persistent URL (PURL) • PURL’s are one possible solution to the problem of developing URN’s. • PURL’s look and function much like URLs, but instead of pointing directly to the location of an Internet resource, a PURL points to an intermediate resolution service. • Ex: PURL http://purl.oclc.org/NET/frankdill/index.html URL:http://members.aol.com/aiaio/index.html

PURL Server • The process of resolution now involves one extra step in which a PURL server associates the PURL with a unique URL which is returned to the client. • The extra step is an HTTP redirect • The key to PURL server is indirection, not redirection, naming something to separate location from identification.

PURL database • PURLs must be maintained. • A change to the PURL database is required if the owner of a file moves that file. • If file is completely removed, of course, the PURL is as useless as the URL.

Some alternatives • Meanwhile, there is a website that archives all the web pages that posted on the World Wide Web. • The name of it is Internet Archive . • The URL for it is http://www.archive.org .

Conclusion • Introduction of web storage • Problems • Solutions • Some alternatives for right now

References: • Towards an Archival Intermemory by Goldberg & Yianilos http://citeseer.nj.nec.com/goldberg98towards.html • Introduction to Persistent Uniform Resource Locators by Shafer, Weibel, Jul & Fausey http://purl.oclc.org/docs/inet96.html • Web Storage: The Permanence of Information by F. Dill http://external.nj.nec.com/homepages/krovetz/timetable.html

Any Questions?

Web Storage: Permanence of Information