NSDL Persistent Archive Service

NSDL Persistent Archive Service Charles Cowart SDSC Summer Institute August 25th, 2004

What is the Archive Service? • A system of OAI harvesting, web crawling, and data management tools used to copy or ‘back up’ data hosted on the web. • An interface for NSDL and other communities to access that data through the web, as though it was the original.

Why is it needed? • Many research projects have made valuable data available to the scientific and educational community on the World Wide Web. • Educators may base some of their curriculum on data which may change or become lost should the website be taken offline. • PAS offers a history of snapshots that can be relied upon should the original change or disappear.

How does it work? • The NSDL makes available a ‘card catalog’ of valuable websites via the web and an XML based protocol called OAI-PMH. • Each website is represented as an OAI record, which looks something like this:

Anatomy of an OAI Record, part 1 <record><header> <identifier>oai:arXiv.org:cs/0112017</identifier> <datestamp>2002-0-2-28</datestamp> <setSpec>math</setSpec></header> <metadata><oai_dc:dc ...misc schema references...>

Anatomy of an OAI Record, part 2 <dc:title>Using Structural Metadata</dc:title> <dc:creator>Dushay, Naomi</dc:creator> <dc:subject>Digital Libraries</dc:subject> <dc:description>blahblah<dc:description> <dc:identifier>http://arXiv.org/abs/cs/0112017 </dc:identifier> </oai_dc:dc></metadata></record>

How does it work? • A program called a Harvester collects the data from the NSDL by connecting to a web server in much the same way a web browser does. • The Harvester will extract links or URLs, filter out the ‘bad ones’, as well as URLs for sites we are not allowed to crawl.

The Crawlmaster • The Harvester sends the extracted URLs and sends them to another program called the Crawlmaster. • The Crawlmaster’s job is to oversee the execution of n number of web crawlers. • The crawlmaster ensures a constant flow of crawling activity; as a job completes it issues a new one.

The Crawlers • Crawlers are processes responsible for collecting and transforming the data made available through an initial URL. • It will capture the data referenced by the initial URL, and if that data is HTML, will collect material referenced by that HTML as well as subsequent HTML pages. • The Crawler uses a set of heuristics to gauge how much to collect and how much to leave alone. • Once the crawling is complete, each HTML page’s links are rewritten to point to our copied material, rather than the original.

The Web Page, Recoded Inside a collected web page, this: <A HREF=“http://www.priweb.org/ed/earthtrips/Edisto/edfossil.html> Look at additional fossils from Edisto Beach</A> Becomes this: <A HREF="http://srb.npaci.edu/cgi-bin/nsdl.cgi?uid= /2004-01-07T01:59:16Z/24B150FF509302235AEAAF4894557D1D /edfossil.html">Look at additional fossils from Edisto Beach</A>

The Storage Resource Broker • Once a crawl is complete, the material is stored in SDSC’s Storage Resource Broker. • The Storage Resource Broker allows us to store, manage, and easily retrieve very large collections of files. • Each collection in the SRB is associated with the original URL and OAI record ID issued by the NSDL and obtained by the Harvester.

Accessing the Archive Service • The archived collections are served up through an HTTP portal to the SRB. • The HTTP portal is implemented as a web CGI interface which takes as a parameter the OAI record ID issued by the NSDL. • The portal will then return either the initial page to the latest archive directly, or XML data referencing the entire archive history of that record.

Some Statistics • As of July 15th, the NSDL Persistent Archive contained: • 3,008 Gigabytes of data • In 21,420,181 files • Representing ~91,000 distinct URLs. • Spanning eight months from December 2003 to July 2004. • Span represents a complete archive

NSDL Persistent Archive Service