1 / 34

Storage Resource Broker: Overview and History

Explore the features, misconceptions, and history of the Storage Resource Broker (SRB) system, providing automated data support for grids, digital libraries, archives, and more.

djillian
Download Presentation

Storage Resource Broker: Overview and History

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storage Resource Broker Overview Storage Resource Broker Reagan W. Moore moore@sdsc.edu http://www.sdsc.edu/srb

  2. SRB Developers Reagan Moore - PI Michael Wan - SRB Architect Arcot Rajasekar - SRB Manager Wayne Schroeder - SRB Productization Charlie Cowart - inQ Lucas Gilbert - Jargon Bing Zhu - Perl, Python, Windows Antoine de Torcy - mySRB web browser Sheau-Yen Chen - SRB Administration George Kremenek - SRB Collections Arun Jagatheesan - Matrix workflow Marcio Faerman - SCEC Application Sifang Lu - ROADnet Application Richard Marciano - SALT persistent archives 75 FTE-years of support About 300,000 lines of C

  3. SRB Objectives • Automate all aspects of data discovery, access, management, analysis, preservation • Provide distributed data support for • Data sharing - data grids • Data publication - digital libraries • Data preservation - persistent archives • Data collections - Real time sensor data

  4. Misconceptions • SRB is not a file system • Although it does have a Unix I/O library interface • SRB is not a backup system • Although it does support replicas, versions, and snapshots of files and containers • SRB is not a database mediation system • Although it does support remote SQL execution • SRB is a shared collection

  5. Digital Entities • Register digital entities into a metadata catalog • Files • Binary large objects in databases • Objects in Object Ring Buffers • URLs • SQL command strings • Shadow links to existing directories

  6. History • 1995 - DARPA Massive Data Analysis Systems • 1997 - DARPA/USPTO Distributed Object Computation Testbed • 1998 - NSF National Partnership for Advanced Computational Infrastructure • 1998 - DOE Accelerated Strategic Computing Initiative data grid • 1999 - NARA persistent archive • 2000 - NASA Information Power Grid • 2001 - NLM Digital Embryo digital library • 2001 - DOE Particle Physics data grid • 2001 - NSF Grid Physics Network data grid • 2001 - NSF National Virtual Observatory data grid • 2002 - NSF National Science Digital Library persistent archive • 2003 - NSF Southern California Earthquake Center digital library • 2003 - NIH Biomedical Informatics Research Network data grid • 2003 - NSF Real-time Observatories, Applications, and Data management Network • 2004 - NSF ITR, Constraint based data systems • 2005 - LC Digital Preservation Lifecycle Management • 2005 - LC National Digital Information Infrastructure and Preservation program

  7. Development • SRB 1.1.8 - December 15, 2000 • Basic distributed data management system • Metadata Catalog • SRB 2.0 - February 18, 2003 • Parallel I/O support • Bulk operations • SRB 3.0 - August 30, 2003 • Federation of data grids • SRB 3.3.1 - April 6, 2005 • Feature requests (extensible schema)

  8. US Academic Institutions (2005)

  9. US Academic Institutions (2005)

  10. International Institutions (2005)

  11. International Institutions (2005)

  12. Federated Server Architecture Peer-to-peer Brokering Read Application Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning R1 MCAT 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 Data Access

  13. Storage Resource Broker 3.3.1 C Library, Java Unix Shell Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Application HTTP, DSpace, OpenDAP, GridFTP OAI, WSDL, (WSRF) DLL / Python, Perl, Windows Linux I/O C++ NT Browser, Kepler Actors Federation Management Consistency & Metadata Management / Authorization, Authentication, Audit Latency Management Metadata Transport Logical Name Space Data Transport Database Abstraction Storage Repository Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix ORB

  14. SRB Latency Management Remote Proxies, Staging Data Aggregation Containers Prefetch Network Destination Network Destination Source Caching Client-initiated I/O Streaming Parallel I/O Replication Server-initiated I/O

  15. Latency Management -Bulk Operations • Bulk register • Create a logical name for a file • Load context (metadata) • Bulk load • Create a copy of the file on a data grid storage repository • Bulk unload • Provide containers to hold small files and pointers to each file location • Bulk delete • Trash can • Sticky bits for access control, …

  16. Authentication / Authorization • Collection owned data • At each remote storage system, an account ID is created under which the data grid stores files • User authenticates to the data grid • Data grid checks access controls • Data grid server authenticates to a remote data grid server • Remote data grid server authenticates to the remote storage repository

  17. Authentication Mechanisms • Grid Security Infrastructure • PKI certificates • Challenge-response mechanism • No passwords sent over network • Ticket • Valid for specified time period or number of accesses • Generic Security Service API • Authentication of server to remote storage

  18. Terminology • 1998 • Data Grid, a data management system that organizes distributed data into collections • Data Grid - data virtualization • 2000 • Persistent Archive, a data management system that handles technology evolution • Persistent Archive - infrastructure independence

  19. Generic Data Management • Data grids provide capabilities needed by digital libraries and persistent archives • Infrastructure independence to manage a collection distributed across multiple storage systems • Descriptive metadata to describe authenticity context of each file • Administrative metadata to maintain integrity • Location, replicas, checksums, audit trails, ownership, access controls, versions, locks, pinning, aggregation • Data grids are implemented as middleware • Does not require root for installation

  20. Infrastructure Independence • Data virtualization • Management of name spaces independently of the storage repositories • Global name spaces • Persistent identifiers • Collection-based ownership of files • Support for access operations independently of the storage repositories • Separation of access methods from storage protocols • Support for integrity and authenticity operations

  21. Separation of Access Method from Storage Protocols Access Method Map from the operations used by the access method to a standard set of operations used to interact with the storage system Access Operations Data Grid Storage Operations Storage Protocol Storage System

  22. Data Grid Operations • File access • Open, close, read, write, seek, stat, synch, … • Audit, versions, pinning, checksums, synchronize, … • Parallel I/O and firewall interactions • Versions, backups, replicas • Latency management • Bulk operations • Register, load, unload, delete, … • Remote procedures • HDFv5, data filtering, file parsing, replicate, aggregate • Metadata management • SQL generation, schema extension, XML import and export, browsing, queries, • GGF, “Operations for Access, Management, and Transport at Remote Sites”

  23. Examples of Extensibility • The 3 fundamental APIs are C library, shell commands, Java • Other access mechanisms are ported on top of these interfaces • API evolution • Initial access through C library, Unix shell command • Added iNQ Windows browser (C++ library) • Added mySRB Web browser (C library and shell commands) • Added Java (Jargon) • Added Perl/Python load libraries (shell command) • Added WSDL (Java) • Added OAI-PMH, OpenDAP, DSpace digital library (Java) • Added Kepler actors for dataflow access (Java) • Added GridFTP version 3.3 (C library)

  24. Examples of Extensibility • Storage Repository Driver evolution • Initially supported Unix file system • Added archival access - UniTree, HPSS • Added FTP/HTTP • Added database blob access • Added database table interface • Added Windows file system • Added project archives - Dcache, Castor, ADS • Added Object Ring Buffer, Datascope • Added GridFTP version 3.3 • Database management evolution • Postgres • DB2 • Oracle • Informix • Sybase • mySQL (most difficult port - no locks, no views, limited SQL)

  25. Logical Name Spaces Data Access Methods (C library, Unix, Web Browser) • Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints Data access directly between application and storage repository using names required by the local repository

  26. Logical Name Spaces Data Access Methods (C library, Unix, Web Browser) Data Collection • Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints Data is organized as a shared collection

  27. Federation Between Data Grids Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection A Data Collection B • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints Access controls and consistency constraints on cross registration of digital entities

  28. Data Grids • Largest single data grids • ROADnet real-time sensor network, links 100 object ring buffers supporting 10,000 sensors • BaBar high energy physics, distributes SLAC collections to Lyon,France and Rome, Italy • Biomedical Informatics Research Network, links data resources across 25 institutions within the US • Largest data grid federations • KEK high-energy physics federation of 7 data grids between Japan, South Korea, China, Taiwan, Australia, Poland, US • WUN federation of 5 academic institutions (SDSC, NCSA, U Bergen, U Southampton, U Manchester) • NARA Research Prototype Persistent Archive (SDSC, U Maryland, NARA)

  29. Peer-to-Peer Zones Free Floating Partial User-ID Sharing Occasional Interchange PartialResource Sharing Replicated Data No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing Resource Interaction Nomadic System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing User and DataReplica System Managed Replication Connection From Any Zone Complete Resource Sharing Snow Flake Super Administrator Zone Control Replicated Catalog Master Slave Replication Zones System Controlled Complete Synch Complete User-ID Sharing Archival Hierarchical Zones

  30. Types of Risk • Media failure • Replicate data onto multiple media • Vendor specific systemic errors • Replicate data onto multiple vendor products • Operational error • Replicate data onto a second administrative domain • Natural disaster • Replicate data to a geographically remote site • Malicious user • Replicate data to a deep archive

  31. How Many Replicas • Three sites minimize risk • Primary site • Supports interactive user access to data • Secondary site • Supports interactive user access when first site is down • Provides 2nd media copy, located at a remote site, uses different vendor product, independent administrative procedures • Deep archive • Provides 3rd media copy, staging environment for data ingestion, no user access

  32. Chronopolis Preservation Facility NCAR U Md SDSC MCAT MCAT MCAT Deep Archive at NARA, no user access but complete copy Replicated copy at U Md for improved access, load balancing and disaster recovery Active archive at SDSC, user access Federation of Three Independent Data Grids • Demonstrate preservation environment • Authenticity • Integrity • Management of • technology evolution • Mitigation of risk of data loss • Replication of data • Federation of catalogs • Management of preservation • metadata • Scalability • 3 collections / year • Support 100 TBs per site

  33. Data Management Systems • Digital Libraries • DSpace services for ingestion and description of files - ported on top of SRB data grid • Fedora relationship management services, planned port on top of SRB data grid • Cheshire integration on top of SRB data grid • OpenDAP interface to the SRB data grid • Persistent archives • Manage authenticity, integrity, and infrastructure independence • Integrate preservation processes on top of SRB data grid

More Related