Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David Chiu, Yu Su, ..)

Motivation • Cloud Resources • Pay-as-you-go • Elasticity • Black boxes from a performance view-point • Scientific Data • Specialized formats, like NetCDF, HDF5, etc. • Very Large Scale

Ongoing Work at Ohio State • MATE-EC2: Middleware for Data-Intensive Computing on EC2 • Alternative to Amazon Elastic MapReduce • Data Management Solutions for Scientific Datasets • Target NetCDF and HDF5 • Accelerating Data Mining Computations Using Accelerators • Resource Allocation Problems on Clouds

MATE-EC2: Motivation • MATE – MapReduce with an Alternate API • MATE-EC2: Implementation for AWS Environments • Cloud resources are blackboxes • Need for services and tools that can… • get the most out of cloud resources • help their users with easy APIs

MATE vs. Map-Reduce Processing Structure • Reduction Objectrepresents the intermediate state of the execution • Reduce func. is commutative and associative • Sorting, grouping.. overheads are eliminated with red. func/obj.

MATE-EC2 Design • Data organization • Three levels: Buckets, Chunks and Units • Metadata information • Chunk Retrieval • Threaded Data Retrieval • Selective Job Assignment • Load Balancing and handling heterogeneity • Pooling mechanism

MATE-EC2 Processing Flow S3 Data Object Computing Layer T T T T C C C Metadata File Job Scheduler 1 2 3 n 0 5 0 EC2 Master Node EC2 Slave Node Retrieve chunk pieces and Write them into the buffer Pass retrieved chunk to Computing Layer and process Request another job Request Job from Master Node C0 is assigned as job C5 is assigned as a job Retrieve the new job

Experiments • Goals: • Finding the most suitable setting for AWS • Performance of MATE-EC2 on heterogeneous and homogeneous environments • Performance comparison of MATE-EC2 and Map-Reduce • Applications: KMeans and PCA • Used Resources: • 4 Large EC2 instances for processing, 1 Large instance for Master • 16 Data objects on S3 (8.2GB total data set for both app.)

Diff. Data Chunk Sizes • KMeans • 16 Retrieval threads • Performance increase • 8M vs. others • 1.13 to 1.30 • 1 Thread vs. 16 Threads versions • 1.24 to 1.81

Diff. Number of Threads • 128MB chunk size • Performance increase in Fig. (KMeans) • 1.37 to 1.90 • Performance increase for PCA • 1.38 to 1.71

Selective Job Assignment • Performance increase in Fig. (KMeans) • 1.01 to 1.14 • For PCA • 1.19 to 1.68

Heterogeneous Env. • L: Large instances S: Small instances • 128MB chunk size • Overheads in Fig. (KMeans) • Under 1% • Overheads for PCA • 1.1 to 11.7

MATE-EC2 vs. Map-Reduce • Scalability (MATE) • Efficiency: 90% • Scalability (MR) • Efficiency: 74% • Speedups: • MATE vs. MR • 3.54 to 4.58

MATE-EC2: Continuing Directions • Cloud Bursting • Cloud as an Complement or On-Demand Alternative to Local Resources • Autotuning for a New Cloud Environment • Data Storage can be black-box • Data-Intensive Applications on Cluster of GPUs • Programming Model, System Design

Outline • MATE-EC2: Middleware for Data-Intensive Computing on EC2 • Alternative to Amazon Elastic MapReduce • Data Management Solutions for Scientific Datasets • Target NetCDF and HDF5 • Accelerating Data Mining Computations Using Accelerators • Resource Allocation Problems on Clouds

Data Management: Motivation • Datasets are becoming extremely large • Scientific datasets are in formats like NetCDF and HDF5 • Existing database solutions are not scalable • Can’t help with native data formats

Data Management: Use Scenarios • Data Dissemination Efforts • Support User-Defined Subsetting and Data Aggregation • Implementing Data Processing Applications • Higher-level API than NetCDF/HDF5 libraries • Visualization Tools (ParaView etc.) • Data format Conversion on Large Datasets

Initial Prototype: Data Subsetting With Relational View on NetCDF Parse the SQL expression Metadata for netcdf dataset Filter dimensions Generate data access code Partition tasks and assign to slave processes Execute query Filter variable value

Metadata descriptor • Dataset Storage Description • List the nodes and the directories where the data is resident. • Dataset Layout Description • Header part of each netcdf file • Naturally included in netcdf dataset • Save the energy for generating the metadata • Describe the layout of each netcdf file

Pre-filter and Post-filter • Pre-filter: • Take SQL grammar and metadata as input • Do filtering based on dimensions of variable • Support both direct dimensions and coordinate variable • Post-filer: • Do filtering based on variable value

Query Partition • Partition current query into several sub-queries and assign each sub-query to a slave process. • Two partition criteria • Consider the continuous of the memory • Consider data aggregation(future)

Experiment Setup • Application: • Global Cloud Resolving Model and Data (GCRM) • Environment: • Glenn System in Ohio Supercomputer Center

SQL queries

Scalability with different data size • 8 processes • Execution time scaled almost linearly within each query

Time improvement for using pre-filter • 4 processes; • SQL5 (only query 1% of the data); • Prefilter efficiently decreases the query size, improve the performance.

Scalability with Increasing No. of Sources • 4G dataset; • SQL1 (full scan of the data table); • Execution time scaled almost linearly

Data Management: Continuing Work • Similar Prototype with HDF5 under Implementation • Consider processing, not just subsetting/aggregation • Map-Reduce like Processing for NetCDF/HDF5 datasets? • Consider Format Conversion for Existing Tools

Outline • MATE-EC2: Middleware for Data-Intensive Computing on EC2 • Alternative to Amazon Elastic MapReduce • Data Management Solutions for Scientific Datasets • Target NetCDF and HDF5 • Accelerating Data Mining Computations • Resource Allocation Problems on Clouds

System for Mapping to Heterogeneous Configurations Application Developer Run-time System Compilation Phase Worker Thread Creation and Management Code Generator User Input: Simple C code with annotations Map Computation to CPU and GPU Multi-core Middleware API GPU Code for CUDA Dynamic Work Distribution

K-Means on GPU + Multi-Core CPUs

Summary Dataset Sizes are Increasing Clouds add many challenges Many challenges in data processing on clouds

Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

Presentation Transcript

Cyberenvironments: Adaptive Middleware for Scientific Cyberinfrastructure

Processing Data Intensive Queries in Scientific Database Federations

Scientific Annotation Middleware (SAM)

MapReduce for Data Intensive Scientific Analyses

Data Intensive Scientific Compute Model for Multicore clusters

IBM Middleware Solutions for Telecommunications

Data Management Challenges of Data-Intensive Scientific Workflows

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications

Data Management Challenges of Large-Scale Data Intensive Scientific Workflows

Data Management Middleware

Enabling Data Intensive Science with PetaShare

Data-Intensive Scientific Discovery

Comparing Map-Reduce and FREERIDE for Data-Intensive Applications

Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

Huge-Memory Systems for Data-Intensive Science

A Grid-Based Middleware for Scalable Processing of Remote Data

Data Reservoir: Utilization of Multi-Gigabit Backbone Network for Data-Intensive Research

Processing Data-Intensive Queries in Petabyte-Scale Scientific Databases

IBM Middleware Solutions for Telecommunications

Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

Huge-Memory Systems for Data-Intensive Science

Enabling Data Intensive Science with PetaShare