A MapReduce Workflow System for Architecting Scientific Data Intensive Applications

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or halem @umbc.edu

Introduction • Scientific workflow systems • Support the integration of components for data acquisition, transformation, and analysis to build complex data-analysis • Computation is often used to derive new data from old one • Many scientific workflow systems are data-driven or even • Computational steps are represented as nodes and data between these tasks is made explicit using edges between them • Scientific workflow advantages compare to script base approaches • Component model • Data source discovery • Provenance framework • Parallelism • Limitations • Little Support for Data Modeling, NetCDF, XML limited in Kepler and Triana • Scientific workflows should tolerate certain changes in the structure of its input data and easy to modify • Parallelism • Optimization is not Performed Automatically

The MapReduce workflow system • Exploiting data parallelism through MapReduce programming model • We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of • workflow design tools: A simple workflow design C++ API, will support XML design tool. • Automatically optimization: a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. • Support for Data Modeling: data parallel processing scales out linearly with Hbase data storage and management • (random access, real time read/write access to data intensive applications) • data transfer tools ( for HDF, NetCDF) • Use case: A climate satellite data intensive processing and analysis application.

The proposed workflow system w=<J, D, E> • Where w is the workflow J is a set of jobs as vertices in a DAG graph. E is a set of directed edges. D is a set of datasets. The input dataset I and output dataset O for the workflow w are special types of D. • J1 =>J2 a directed edge states that job j1 produces a dataset d and job j2 consumes it as an input. • Let is the estimated execution time of a job j, and is estimated based on the previous execution of the job on the unit input of the dataset. • For given job j the estimated remaining time is

The MapReduce Constructs and Examples • MapReduce • map(k1,v1)->list(k2,v2) • reduce(k2,list(v2))->list(v3). Figure 1. An example of the climate data processing and analysis workflow in steps

User case: a climate data intensive application • AIRS and MODIS level 1B data over 10 years ~ Petabytes • AIRS, MODIS gridded dataset at 100km ~ Terabytes Fig 2 use case: scientific climate application

Fig 3. left. The performance of gridding MODIS 1 day 48 GB on bluegrit the PPC cluster 524MB RAM each nodes right. The speed up comparison between sequential data processing using Sector/Sphere system on 48GB MODIS data on bluegrit Intel cluster 32GB RAM each node Fig 4. left . Hadoop MapReduce on average AIRS gridded dataset stored on Hbase tables performance on 9 Intel nodes. right. Hadoop gridding with different input dataset size vs non Hadoop gridding (text format)

Summary • We propose a workflow system for scientific applications focus on • Exploiting data parallelism through MapReduce programming model • Automatically optimization scheduling MapReduce jobs to Hadoop or Sector/Sphere clusters • Support for Data Modeling supportdata storage and management through Hadoop database (Hbase), data transfer tools (HDF, NetCDF)

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications

Presentation Transcript

Data-Intensive Text Processing with MapReduce

Applications and Requirements for Scientific Workflow Introduction

Data-Intensive Scalable Science: Beyond MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems

MapReduce for Data Intensive Scientific Analyses

Applications and Requirements for Scientific Workflow

Data-Intensive Computing with MapReduce

MapReduce and Data Intensive NLP

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

KEPLER Scientific Workflow System

Evolving Scientific Data Workflow

Runtime Data Management for Data-Intensive Scientific Applications

Data-Intensive Scientific Discovery

KEPLER Scientific Workflow System

System Support for Data-Intensive Applications

Applications and Requirements for Scientific Workflow Introduction

System Support for Data-Intensive Applications