1 / 32

PDS Data Movement and Storage Planning (PMWG)

PDS Data Movement and Storage Planning (PMWG). PDS MC F2F UCLA Dan Crichton November 28-29, 2012. Growth of Planetary Data Archived from U.S. Solar System Research. Yes, size matters, but so does complexity…. Big Data Challenges. Storage Computation Movement of Data Heterogeneity

chul
Download Presentation

PDS Data Movement and Storage Planning (PMWG)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PDS Data Movement and Storage Planning (PMWG) PDS MC F2F UCLA Dan Crichton November 28-29, 2012

  2. Growth of Planetary Data Archived from U.S. Solar System Research Yes, size matters, but so does complexity…

  3. Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data. …commodity computing can help, if architected correctly

  4. Big Data Technologies

  5. Architecting PDS Towards a Decoupled Architecture Data Movement Computation Core PDS Data Movement Heterogeneous Data Transform Ingest Data Providers PDS Data Management Distribution Users Transform Improve efficiency and support to deliver high quality science products to PDS Preserve and ensure the stability and integrity of PDS data Improve user support and usability of the data in the archive Storage

  6. Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data.

  7. Storage Eye Chart • Direct Attached Storage (DAS) • DAS based storage (usually disk or tape) is directly attached to internal server (point-to-point). • Network Attached Storage (NAS) • A NAS unit or “appliance” is a dedicated storage server connected to an Ethernet network that provides file-based data storage services to other devices on the network. NAS units remove the responsibility of file serving from other servers on the network. • Storage Area Network (SAN) • SAN is an architecture to connect detached storage devices, such as disk arrays, tape libraries, and optical jukeboxes, to servers in a way that the devices appear as local resources. • Redundant Array of Inexpensive Disks (RAID) • The concept of RAID is to combine multiple inexpensive disk drives into an array of disk drives which perform (usually) better then a single disk drive. The RAID array will appear as a single drive to the connected server. RAID technology is typically employed in a DAS, NAS, or SAN solution. • Cloud Storage • Cloud Storage involves storage capacity that is accessed through the internet or wide area network (WAN) , storage is usually purchased on an as-needed basis. Users can expand capacity on the fly. Providers operates a highly scalable storage infrastructure ,often in physically dispersed locations. • Solid State Drive Storage • Solid State Drive storage technology is evolving to a point where SSDs can, in some cases, start to supplant traditional storage. SSDs that use DRAM-based technology (volatile memory) cannot survive a power loss but flash-based SSDs (non-volatile), although slower then DRAM-based SSDs, do not require a battery backup and therefore become acceptable in the enterprise. It has recently been announced that 1TB SSDs are available for industrial applications, like military, medical and the like. SSD technology is rapidly evolving and in the near future will be a major contender in the storage arena.

  8. Storage Architectural Concepts • Decentralized • In-house storage locally attached • Resource managed (procured, backed up, maintained, replenished) by locally • Centralized • Common storage at a central remote • But, not necessarily separation of data, catalog and services • Cloud • Storage as a virtual cloud infrastructure resourced over the Internet • Resource managed by a third party / organization

  9. Cloud Deployment Models • Public Cloud: • Cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services (e.g. Amazon, RackSpace, Nirvanix) • Applications are typically “multi-tenant” and physical infrastructure is shared • Private Cloud: • Cloud infrastructure is operated solely for an organization. It may be managed by the organization or on their behalf by a third party and may exist on premise or at a provider’s site in a hosting center. Could be using cloud software (e.g., Eucalyptus) • Hybrid Cloud: • Organization provides and manages some resources in-house and has others provided externally • Possibility to leverage existing technologies and future technologies with minimal cost (e.g. backup/archive data managed externally, operational data managed internally) Photo credit: AcuteSys

  10. Many Benefits of Cloud Computing Broad network access Accessible from anywhere Resource Pooling Shared pool of configurable computing resources; reliability through replicas, etc Rapid Elasticity Scale when needed with storage and services/cores, etc Measured Service Utility Computing, pay by the drink, rapidly provisioned

  11. Challenges of Cloud Storage • Data Integrity • Ownership (local control, etc) • Security • ITAR • Data movement to/from cloud • Procurement • Cost arrangements

  12. The Planetary Cloud Experiment • Utility to PDS • How does it fit PDS4 architecture • APIs • Decoupled storage and services • Data movement challenges? • Cloud Storage Tested as a secondary storage option • iRODS @ SDSC, Amazon (S3), Nirvanix IEEE Pro, Sept/Oct 2010

  13. Results of Study • Moving massive amounts of data “online” a limiting factor…more to come • Varying cost scenarios • (target < $500/TB/year) • Proprietary APIs (but some open source cloud implementations gaining steam) • But, entirely feasible as a decoupled ”storage service” in PDS4 • Low risk option is to explore as an operational, secondary copy and access point for planetary data Nirvanix iRODS @ SDSC Amazon

  14. Benchmarking (2009)

  15. MER Planning on the Cloud * Credit: Khawaja Shams

  16. S3 5x Archive, Compression, Encryption (in memory) Daily Mars Data Parallel Uploads to S3 Polyphony Schedules Backups for Each of the Last 5 Days Daily MER Planning: Backup to the Cloud* * Credit: Khawaja Shams, George Chang

  17. S3 If Downloaded Backup Does Not Match Local Data Polyphony Immediately Schedules Another Backup of Inconsistent Data MER Planning: Data Integrity on the Cloud

  18. Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data.

  19. Cloud Computing and Computation • On-demand computation (scaling to massive number of cores) • Amazon EC2, one of the most popular • Commoditizing super-computing • Again, architecting systems to decouple “processing” and “computation” so it can be executed on the cloud is key… two examples • LMMP example (to come) • Airborne data processing (to come) • Coupled with computational frameworks (e.g., Apache Hadoop) • Open source implementation of Map-Reduce

  20. Lunar Mapping and Modeling Project: Big Data Challenges* • The image files LMMP manages range from a few gigabytes to hundreds of gigabytes in size with new data arriving every day • Lunar surface images are too large to efficiently load and manipulate in memory • LMMP must make the data readily available in a timely manner for users to view and analyze • LMMP needs to accommodate large numbers of users with minimal latency * Credit: Emily Law, George Chang

  21. Cloud Computing Solutions with Map-Reduce • Slice a large image into many small images and to merge and resize until the last merge and reduce yields a reasonably sized image that depicts the entire image • Amazon EC2 for computing; S3 for storage • Installed Hadoop framework on a number of EC2 instances • Used distributed approach with Elastic Map-Reduce in Hadoop to tile images • Developed a hybrid solution (multi-tiered data access approach) to serve images to users by cloud storage

  22. LMMP Tiling Test Results(Cloud vs Local) • Configuration 1 • 2x Sun Fire 4170 • Gigabit Network Interconnects • 72 GB RAM • 64 GB SSD Storage • $10K each, plus administration and infrastructure costs • Configuration 2 • 20 EC2 Large Instances (4 Compute Units ~ 4x1GHz Xeon) • 7.5 GB RAM • 850 GB Storage • $0.34/instance/hour • Configuration 3 • 4 EC2 CC Instances (33.5 Compute Units) • Gigabit Interconnects • 23 GB RAM • 1.69 TB Storage • $1.60/instance/hour

  23. Cloud Computing: Addressing Challenges • Cloud has shown very promising results, but there are challenges • Proprietary APIs • Support for ITAR-sensitive data • Data transfer rates to the commercial cloud • Firewall issues • Procurement • Costs for long term storage • More work ahead • Amazon EC2/S3 reported an “ITAR Region” available • Continued benchmarking and optimization has demonstrated increased data transfer rates, particularly using Internet2 • JPL developing a “Virtual Private Cloud” connection to Amazon, causing EC2 nodes to appear inside the JPL Firewall • Improved procurement process to allow JPL projects to use AWS

  24. Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data.

  25. The Planetary Data Movement Experiment • Online data movement has been a limiting factor for embracing big data technologies • Conducted in 2006*, 2009 and 2012 • Evaluate trade offs for moving data • to PDS • between Nodes • to NSSDC/deep archive • to Cloud * C. Mattmann, S. Kelly, D. Crichton, J. S. Hughes, S. Hardman, R. Joyner and P. Ramirez. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of the NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006), pp. 131-135, College Park, Maryland, May 15-18, 2006

  26. Data Xfer Technologies Evaluated • FTP uses a single connection from transferring files; in general it is ubiquitous and where possible the simplest way for PDS to transfer data electronically • bbFTP uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit • GridFTP uses multiple threads/connections. It is part of the Globus project and is used by the climate research community to move models. In general, tests have shown that it is more difficult to set up due to the security infrastructure, etc • iRODS uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit • FDT uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit

  27. Some of our Findings Data Movement of WAN using TCP/IP • Transfer speed among the nodes differ greatly, however, the fundamental findings about how to best transfer data for each scenario is consistent • Parallel transfer mechanisms show improvement over conventional transfer mechanisms (FTP, socket-to-socket) for files larger than ~10MB • Packaging/bundling small files help to achieve significantly better transfer performance with parallel data transfer • Reliability has improved over the past five years in many of the products we have tested • However, UDP approaches have suffered largely due to more aggressive network infrastructure seeing this as distributed denial of service attacks (DDOS) Transfer rate (Y axis) versus file size (X axis)GridFTP: blue, bbFTP: red, FTP: green

  28. Data Movement Recommendations (2010)

  29. Pilot with DNs (Big Data) • iRODS has shown to be the most promising for data transfer • Setting up an iRODS infrastructure for data movement with 3 zones: GEO, USGS, JPL/IMG as a pilot • Run along side other mechanisms • Expand to other nodes if this proves successful

  30. Benchmarks

  31. Benchmarks (2)

  32. Recommendations • Data Movement • PMWG will update its current data movement recommendations based on these results • Run current data movement deployment in parallel to FTP and other mechanisms as a pilot • Consider adding another “zone” at NSSDC for electronic data transfers • Capture updated benchmarks for Flagstaff after the network upgrade • Other DNs worry about this when they hit the larger thresholds • Data Storage • We have quite a bit of experience now with cloud computing, etc to comment • Focus on requirements for data storage (e.g., storage service) as other development activities are under control • Computation • The new PDS4 architecture allows us to run computationally intensive services in many different topologies. Explore as needed.

More Related