1 / 45

Collection-based Persistent Archives

Collection-based Persistent Archives. Reagan W. Moore Associate Director, Data Intensive Computing San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE. Topics. Experiences learned building a prototype Persistent Archive Information model

jayden
Download Presentation

Collection-based Persistent Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive Computing San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE

  2. Topics • Experiences learned building a prototype Persistent Archive • Information model • Hierarchical levels of information • Interoperability mechanisms • Application to workshop topics • Ingestion methodology • Data set identification • Certification of archives

  3. Persistent Archive Goals • Provide collection based archive • Data set relevance is organized by the collection • Provide information model to describe the context for the data collection • Enough information is needed to be able to dynamically create the collection from archived information • Decouple collection creation from digital object archiving • Provide accessioning system to turn data sets into digital objects • Accessioning is independent of the final collection

  4. NARA Persistent Archive Prototype • Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036) • 2.5 GB of data • 6 required fields • 13 optional fields • User defined fields (over 1000) • Determine information model needed for persistent archive

  5. Key Concepts Learned • Information model • Semi-structured representation of information - XML • Infrastructure independent representation of information context - XML DTD • Differentiation between information context for digital objects,collection and presentation • DTD for objects • DTD for collection • XSL style sheets for presentation • Instantiation software for creating the collection from the information model • XML databases now appearing

  6. Hierarchy of Information Contexts • Digital object context • Meta-data to define the structure of the object • When publishing a digital object, must also publish the context of the object • Use collections to organize objects • Meta-data to define the structure of the collection • When publishing a collection, must also publish the information needed to organize the collection. • Use presentation context to control access • Meta-data to define structure of presentation

  7. XML DTD for E-mail

  8. Formatted Message Using XML DTD

  9. Key Concepts Learned • Digital object encapsulation • Minimize the number of times a digital object must be touched • Once archived, a digital object should only be retrieved when requested by a user • Implies meta-data stored with digital objects should only describe the objects • Collection and presentation meta-data should be stored separately

  10. Persistent Archive Requirements • Distributed environment to ensure separable components • Accession workbench • Archive • Presentation platform • Data handling mechanisms for interoperability as basis for system evolution • No tightly coupled systems • Unique names are only used by the data handling system • Use of containers to aggregate digital objects for storage • Implies a hierarchical naming scheme • Collection / container / digital object

  11. FTP FTP Electronic Records Archive (ERA) TRANSFER ACCESSION ARCHIVES REFERENCE Accessioning Work Bench (snapin) Reference Workbench (snapin) Retrieve Records Media Handlers Catalog METADATA REPOSITORY RECORDS REPOSITORY Internet Intranet Text Image Photo Video Audio Geographical Information System Compound Records WEB Database Arrangement A R C Query & Reference Tools TAPE TAPE CD U N W R A P P E R CD W R A P P E R DISK DISK record Presentation Metadata wrapper Order Fulfillment

  12. CEED / ESA NASA Catalog DPOSS Sky Survey REINAS 2MASS Sky Survey U Md Archive ADL NS Dig Lib Elib - Flora Wash. Brain Image UCLA Brain Image MSU Brain Image UCSD Neuroscience ESS Dig Lib MS Dig Lib Wash U Genome U H Mol Trajectory Protein Data Bank Federation of Data Collections into Digital Libraries UC Calif Finding Aids NARA Persistent Archive AMICO Image Library UMDL Social Science U Wisc. Video Lib. Pacific Rim DL

  13. Conclusions • Ingestion • Infrastructure independent representation for digital objects • Infrastructure independent representation for information model • Turn data sets into digital objects by adding attribute tags • Aggregate digital objects in containers for storage

  14. Conclusions • Data set identification • Unique names only required by data handling system • Attribute based access through collection • Hierarchical naming • Collection / Container / Digital object • Finding Aid for collection / Data handling system ID for container / Unique ID for object

  15. Conclusion • Certification of persistent archive • Demonstrate that can provide infrastructure independent representation for • Finding aids for locating collections • Information model for building collection • Data handling system container Ids for storage access • Digital object attribute tags • Demonstrate that can use information models to create finding aids, collections, and access interfaces on new technology • Demonstrate that can independently migrate any component of architecture

  16. Further Information http://www.npaci.edu/DICE

  17. NARA Persistent Archive

  18. Context Based Objects • For data to be useful, the context must be defined • Data format - binary/integer representation • Physical meaning - units • Structure - geometry • Relevance - feature annotation • Semantics - data dictionary for attributes • Context is preserved as meta-data attributes

  19. Information Models for Organization of Data Digital Object Attributes Collection Attributes Presentation Attributes

  20. Information Models for Access to Data Presentation of data from multiple digital libraries Collections from federated databases Digital object Attributes

  21. Common Information Model • eXtensible Markup Language (XML) • Use tags to define semantic context for components of the data set • Document Type Definition (DTD) • Provides semi-structured representation for organizing tags that can be applied to groups of digital objects • Development of standards for tags • Digital sky, Protein Data Bank, Neuroscience brain images • California Digital Library - Art Museum Image Consortium

  22. Information Management Hierarchy • Presentation / Information Discovery / Analysis • Visualization - Shastra, 3D visualization tools • Presentation information model - XSL style sheet • Collection organization • Meta-data catalog - MCAT • Collection information model - XML DTD • Data handling • Storage Resource Broker - SRB • Storage • Archival storage system - HPSS • Digital object model - XML DTD

  23. Open Grid Architecture to Encourage Interoperability Application Data Model Management Remote Procedure Execution Information Discovery Data Handling Systems Dynamic Info Discovery Storage System Description Storage Resources

  24. Technology Sources • Archive Community • IEEE Mass Storage Systems Technical Committee • Scalable storage systems • Digital Library Community • NSF Digital Library Initiative, Phase II • Information management mediation - XML • Supercomputer Community • Scalable analysis platforms • Grid Forum • Data handling systems for interoperability • Archivist Community / Library Community • Management policies and standards

  25. Technology Sources Application Data Model Management Digital Library Remote Procedure Execution Information Discovery Data Handling Systems Dynamic Info Discovery Storage System Description Storage Resources Computational Grid

  26. Information Management Architecture • Digital library community technologies • Distributed information resources • Digital library interoperability protocols - SDLIP • Mediation of information using XML - MIX • Grid Forum technologies • Support for distributed services / procedures • Inter-realm authentication • GSI Grid Security Infrastructure • Data handling system • Storage Resource Broker, Meta-data Catalog

  27. Grid Forum Data Access Architecture API that provides “glue” to underlying data handling systems (security, scheduling, QoS, access protocol, data format/model, adaptivity, info discovery, location control) Application + authentication + authorization Data Model Management Remote Procedure Execution Armada D’agents, FEL, ADR GRAM, SRB, Java, CORBA Information Discovery Data Handling Systems LDAP, Database, Flat file, Object database Condor, GASS, NILE, SRB, I-2 caching, ADR Dynamic Info Discovery Storage System Description API that provides “glue” to underlying storage, QoS, etc. [GASS, IBP, SRB] Storage Resources DPSS, DFS, NFS HPSS, ADSM, DMF, Unitree, NASstore, DB2, Oracle, Informix, Sybase, O2, ObjectStore, Objectivity GloPerf, Netlogger, NWS DTD, ADR, object class

  28. Data Handling System Capabilities • SDSC Storage Resource Broker • Protocol transparency • Common API for access to remote data resources • Explicit drivers for each type of storage system • Name transparency • Attribute based access to data • Location transparency • Distribution of collection across multiple physical resources • Time transparency • Minimization of latency for data access

  29. File SID DBLobj SID Obj SID SRB Unix DB2 Oracle ADSM HPSS SDSC Storage Resource Broker & Meta-data Catalog Application Resource User MCAT Dublin Core Application Meta-data

  30. Time Transparency • How to minimize latency • Prefetch data to local high performance disk, so that all accesses can be done at high speed from local resources • How to maximize access rate • Composite or aggregate data into a single data set to avoid multiple accesses • Stream data at high rates using parallel I/O, amortizing the access latency by the volume of data that is delivered. • How to avoid congestion • Replicate data across multiple servers

  31. SRB Containers - Managing Archive Latency SRB client • Create container in a logical storage resource containing at least one “cacheable” resource • Create objects in containers • “Cache” daemon will move filled containers to archive • synch and purge API’s SRB Server UNIX HPSS HPSS container cached containers Distributed Storage Resources

  32. Generality of Information Infrastructure • Same information model needed to manage • Federation in space • Metacomputing environment • Interoperable services for digital libraries • Migration over time • Collection creation and update • Persistent archive • Same storage systems needed to support • Supercomputer center data • Discipline specific data collections • Digital library collections

  33. Art Museum Image Consortium • Demonstrated • Support for heterogeneous digital objects • Automated conversion of meta-data to XML DTD • Validation of meta-data • XSL style sheet for presenting information

  34. AMICO Meta-data Conversion to XML

  35. AMICO Presentation Interface

  36. National Partnership for Advanced Computational Infrastructure • Facilitate the conduct of science through development of knowledge resources • Publish - Data collection infrastructure • Info discovery - Digital Library infrastructure • Data access - Data handling infrastructure • Apply to federal, state, and university projects • NSF / DOE / NASA / USPTO / NARA / Census Bureau • California Digital Library • UCSD - Pacific Rim Digital Library Alliance

  37. Data Storage Archival Storage Applications Collection Building Information Management Digital Library Digital Sky Neuroscience Protein Data Bank Molecular Structures Earth Systems Science CDL UCB - Elib UCSB - ADL Stanford - SDLIP U Michigan - UMDL Publishing Scientific Data Applications Libraries

  38. NPACI is a National Partnership of Partnerships 46 institutions 20 states 4 countries 5 national labs Many projects (new and old) Vendors and industry Government agencies

  39. National Partnership for Advanced Computational Infrastructure • Provide Teraflops / Petabyte capable systems for use by national academic community • Current systems at the San Diego Supercomputer Center • 250 Gflops peak computation rate • IBM SP, CRAY T3E • 250 Terabyte archive capacity, 100 TB in archive • High Performance Storage System • By end of year • 1 TFlop peak computation rate • IBM SP • 500 Terabyte archive capacity

  40. Challenges • Facilitate access to high-end resources • Support data intensive computing • Facilitate access to distributed data resources • Support information discovery • Minimize complexity of user interfaces • Provide unifying data access system • Requires information management infrastructure

  41. Bio-Informatics

  42. Art Museum Image Consortium - AMICO

  43. National Virtual Observatory

  44. California Digital Library

More Related