1 / 41

Security Requirements for Shared Collections

This article discusses the security requirements and state-of-the-art technology for shared collections in the Storage Resource Broker (SRB), including trust virtualization, data integrity challenges, and deep archive implementation.

jvaughn
Download Presentation

Security Requirements for Shared Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore moore@sdsc.edu http://www.sdsc.edu/srb

  2. Abstract • National and international collaborations create shared collections that span multiple administrative domains. • Shared collection are built using: • Data virtualization, the management of collection properties independently of the storage system. • Trust virtualization, the ability to assert authenticity and authorization independently of the underlying storage resources. • What security requirements are implied in trust virtualization? • How are data integrity challenges met? • How do you implement a “Deep Archive”?

  3. State of the Art Technology • Grid - workflow virtualization • Manage execution of jobs (processes) independently of compute servers • Data grid - data virtualization • Manage properties of a shared collection independently of storage systems • Semantic grid - information virtualization • Reason across inferred attributes derived from multiple collections.

  4. Shared Collections • Purpose of SRB data grid is to enable the creation of a shared collection between two or more institutions • Register digital entity into the shared collection • Assign owner, access controls • Assign descriptive, provenance metadata • Manage audit trails, versions, replicas, backups • Manage location mapping, containers, checksums • Manage verification, synchronization • Manage federation with other shared collections • Manage interactions with storage systems • Manage interactions with APIs

  5. Trust Virtualization • Collection owns the data that is registered into the data grid • Data grid is a set of servers, installed at each storage repository • Servers are installed under a SRB ID created for the shared collection • All accesses to the data stored under the SRB ID are through the data grid • Authentication and authorization are controlled by the data grid, independently of the remote storage system

  6. Federated Server Architecture Peer-to-peer Brokering Sget Parallel Data Access Logical Name Or Attribute Condition 1 6 5/6 SRB server SRB server 3 4 5 SRB agent SRB agent 2 Server(s) Spawning R1 MCAT 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control R2 Data Access

  7. Authentication / Authorization • Collection owned data • At each remote storage system, an account ID is created under which the data grid stores files • User authenticates to the data grid • Data grid checks access controls • Data grid server authenticates to a remote data grid server • Remote data grid server authenticates to the remote storage repository • SRB servers return the data to the client

  8. Authentication Mechanisms • Grid Security Infrastructure • PKI certificates • Challenge-response mechanism • No passwords sent over network • Ticket • Valid for specified time period or number of accesses • Generic Security Service API • Authentication of server to remote storage

  9. Trust Implementation • For authentication to work across multiple administrative domains • Require collection-managed names for users • For authorization to work across multiple administrative domains • Require collection-managed names for files • Result is that access controls remain invariant. They do not change as data is moved to different storage systems under shared collection control

  10. Communication • Single port number for control messages • Specify desired control port • Parallel I/O streams are received using multiple port numbers • Specify range of port numbers • To deal with firewalls • Parallel I/O streams initiated from within firewall

  11. Parallel mode Data Transfer – Server Initiated Server behind firewall Peer-to-peer Request Data transfer Sput -m srbObjPut + socket addr , port and cookie 6 1 SRB server2 5 SRB server1 3 4 SRB agent SRB agent 2 Connect to client MCAT 1.Logical-to-Physical mapping 2. Identification of Replicas 3.Access & Audit Control R

  12. Parallel mode Data Transfer – Client Initiated Client behind firewall Connect to server Data transfer Sput -M srbObjPut 7 5 1 6 SRB server2 SRB server1 2 3 SRB agent SRB agent 4 5 Return socket addr., port and cookie MCAT 1.Logical-to-Physical mapping 2. Identification of Replicas 3.Access & Audit Control R

  13. Bulk Load Operation Client behind firewall Bulk Data transfer thread 8 Mb buffer Query Resource Sput -b Return Resource Location 4 1 5 Bulk Registration threads SRB server2 3 SRB server1 Store Data in a temp file SRB agent SRB agent 2 6 MCAT 1.Logical-to-Physical mapping 2. Identification of Replicas 3.Access & Audit Control R Bulk Register Unfold temp file

  14. Third Party Data Transfer Server1 behind firewall or Server1 and Server2 behind same firewall Scp srbObjCopy 1 SRB server SRB server 2 MCAT SRB agent SRB server2 3 5 SRB server1 SRB agent 6 SRB agent 4 R dataPut- socket addr., port and cookie Connect to server2 Data transfer R

  15. Example International Institutions (2005)

  16. KEK Data Grid • Japan • Taiwan • South Korea • Australia • Poland • US • A functioning, general purpose international Data Grid for high-energy physics Manchester-SDSC mirror

  17. BaBar High-energy Physics • Stanford Linear Accelerator • Lyon, France • Rome, Italy • San Diego • RAL, UK • A functioning international Data Grid for high-energy physics Manchester-SDSC mirror Moved over 100 TBs of data

  18. BIRN - Biomedical Informatics Research Network • Link 17 research hospitals and universities to promote sharing of data • Install 1-2 Terabyte disk caches at each site • Register all material into a central catalog • Manage access controls on data, metadata, and resources • Manage information about each image • Audit sharing of data by institution • Extend system as needed • Add disk space dynamically • Add new metadata attributes • Add new sites

  19. NIH BIRN SRB Data Grid • Biomedical Informatics Research Network • Access and analyze biomedical image data • Data resources distributed throughout the country • Medical schools and research centers across the US • Stable high performance grid based environment • Coordinate data sharing • Federate collections • Support data mining and analysis

  20. HIPAA Patient Confidentiality • Access controls on data • Access controls on metadata • Audit trails • End-to-end encryption (manage keys) • Localization of data to specific storage • Access controls do not change when data is moved to another storage system under data grid control

  21. Infrastructure Independence • Data virtualization • Management of name spaces independently of the storage repositories • Global name spaces • Persistent identifiers • Collection-based ownership of files • Support for access operations independently of the storage repositories • Separation of access methods from storage protocols • Support for integrity and authenticity operations

  22. Separation of Access Method from Storage Protocols Access Method Map from the operations used by the access method to a standard set of operations used to interact with the storage system Access Operations Data Grid Storage Operations Storage Protocol Storage System

  23. Data Grid Operations • File access • Open, close, read, write, seek, stat, synch, … • Audit, versions, pinning, checksums, synchronize, … • Parallel I/O and firewall interactions • Versions, backups, replicas • Latency management • Bulk operations • Register, load, unload, delete, … • Remote procedures • HDFv5, data filtering, file parsing, replicate, aggregate • Metadata management • SQL generation, schema extension, XML import and export, browsing, queries, • GGF, “Operations for Access, Management, and Transport at Remote Sites”

  24. Examples of Extensibility • The 3 fundamental APIs are C library, shell commands, Java • Other access mechanisms are ported on top of these interfaces • API evolution • Initial access through C library, Unix shell command • Added iNQ Windows browser (C++ library) • Added mySRB Web browser (C library and shell commands) • Added Java (Jargon) • Added Perl/Python load libraries (shell command) • Added WSDL (Java) • Added OAI-PMH, OpenDAP, DSpace digital library (Java) • Added Kepler actors for dataflow access (Java) • Added GridFTP version 3.3 (C library)

  25. Examples of Extensibility • Storage Repository Driver evolution • Initially supported Unix file system • Added archival access - UniTree, HPSS • Added FTP/HTTP • Added database blob access • Added database table interface • Added Windows file system • Added project archives - Dcache, Castor, ADS • Added Object Ring Buffer, Datascope • Added GridFTP version 3.3 • Database management evolution • Postgres • DB2 • Oracle • Informix • Sybase • mySQL (most difficult port - no locks, no views, limited SQL)

  26. Storage Resource Broker 3.3.1 C Library, Java Unix Shell Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Application HTTP, DSpace, OpenDAP, GridFTP OAI, WSDL, (WSRF) DLL / Python, Perl, Windows Linux I/O C++ NT Browser, Kepler Actors Federation Management Consistency & Metadata Management / Authorization, Authentication, Audit Latency Management Metadata Transport Logical Name Space Data Transport Database Abstraction Storage Repository Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix ORB

  27. Logical Name Spaces Data Access Methods (C library, Unix, Web Browser) • Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints Data access directly between application and storage repository using names required by the local repository

  28. Logical Name Spaces Data Access Methods (C library, Unix, Web Browser) Data Collection • Storage Repository • Storage location • User name • File name • File context (creation date,…) • Access constraints • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints Data is organized as a shared collection

  29. Logical Resource Names • Logical resource name represents multiple physical resources • Writing to a logical resource name can result in: • Replication - write completes when each physical resource has a copy • Load leveling - write completes when the next physical resource in the list has a copy • Fault tolerance - write completes when “k” of “n” resources have a copy • Single copy - write is done to first disk at same IP address, then disk anywhere, then tape

  30. Federation Between Data Grids Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Collection A Data Collection B • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints • Data Grid • Logical resource name space • Logical user name space • Logical file name space • Logical context (metadata) • Control/consistency constraints Access controls and consistency constraints on cross registration of digital entities

  31. Authentication across Zones • Follow Shibboleth model • A user is assigned a “home” zone • User identity is Home-zone:Domain:User-ID • All authentications are done by the “home” zone

  32. User Authentication Validation of Challenge Response Z1 Z2 Challenge/Response Z1:D1:U1 Connect

  33. User Authentication Validation of Challenge Response Z3 Z1 Forward Z2 Challenge/Response Z1:D1:U1 Connect

  34. Deep Archive Firewall Deep Archive Staging Zone Remote Zone Server initiated I/O Pull Pull Z2 Z1 Z3 PVN Register Register No access by Remote zones Z3:D3:U3 Z2:D2:U2

  35. Types of Risk • Media failure • Replicate data onto multiple media • Vendor specific systemic errors • Replicate data onto multiple vendor products • Operational error • Replicate data onto a second administrative domain • Natural disaster • Replicate data to a geographically remote site • Malicious user • Replicate data to a deep archive • Bit flips because of finite bit-error rate

  36. Commands Using Checksum • Registering checksum values into MCAT • at the time of upload • Sput -k - compute checksum of local source file and register with MCAT • Sput -K • checkum verification mode • After upload, compute checksum by reading back uploaded file • Compare with the checksum generated with locally • Existing SRB files • Schksum • compute and register checksum if not already exist • Srsync - if the checksum does not exist

  37. How Many Replicas • Three sites minimize risk • Primary site • Supports interactive user access to data • Secondary site • Supports interactive user access when first site is down • Provides 2nd media copy, located at a remote site, uses different vendor product, independent administrative procedures • Deep archive • Provides 3rd media copy, staging environment for data ingestion, no user access

  38. NARA Persistent Archive Federation of Three Independent Data Grids • Demonstrate preservation environment • Authenticity • Integrity • Management of • technology evolution • Mitigation of risk of data loss • Replication of data • Federation of catalogs • Management of preservation • metadata • Scalability • Types of data collections • Size of data collections NARA U Md SDSC MCAT MCAT MCAT Original data at NARA, data replicated to U Md Replicated copy at U Md for improved access, load balancing and disaster recovery Deep archive at SDSC, no user access, data pulled from NARA

  39. Free Floating Partial User-ID Sharing Occasional Interchange Partial Resource Sharing Replicated Data System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing Hierarchical Zone Organization One Shared User-ID No Metadata Synch User and Data Replica Resource Interaction Nomadic System Managed Replication System Set Access Controls System Controlled Partial Metadata Synch No Resource Sharing System Managed Replication Connection From Any Zone Complete Resource Sharing Peer-to-Peer Zones Replicated Catalog Snow Flake Super Administrator Zone Control Replication Zones Master Slave System Controlled Complete Metadata Synch Complete User-ID Sharing Archival Hierarchical Zones

  40. For more Information on Storage Resource Broker Data Grid Reagan W. Moore moore@sdsc.edu http://www.sdsc.edu/srb/

More Related