80 likes | 183 Views
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications. By Phuong Nguyen and Milton Halem phuong3 or halem @ umbc.edu. Introduction. Scientific workflow systems
E N D
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or halem @umbc.edu
Introduction • Scientific workflow systems • Support the integration of components for data acquisition, transformation, and analysis to build complex data-analysis • Computation is often used to derive new data from old one • Many scientific workflow systems are data-driven or even • Computational steps are represented as nodes and data between these tasks is made explicit using edges between them • Scientific workflow advantages compare to script base approaches • Component model • Data source discovery • Provenance framework • Parallelism • Limitations • Little Support for Data Modeling, NetCDF, XML limited in Kepler and Triana • Scientific workflows should tolerate certain changes in the structure of its input data and easy to modify • Parallelism • Optimization is not Performed Automatically
The MapReduce workflow system • Exploiting data parallelism through MapReduce programming model • We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of • workflow design tools: A simple workflow design C++ API, will support XML design tool. • Automatically optimization: a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. • Support for Data Modeling: data parallel processing scales out linearly with Hbase data storage and management • (random access, real time read/write access to data intensive applications) • data transfer tools ( for HDF, NetCDF) • Use case: A climate satellite data intensive processing and analysis application.
The proposed workflow system w=<J, D, E> • Where w is the workflow J is a set of jobs as vertices in a DAG graph. E is a set of directed edges. D is a set of datasets. The input dataset I and output dataset O for the workflow w are special types of D. • J1 =>J2 a directed edge states that job j1 produces a dataset d and job j2 consumes it as an input. • Let is the estimated execution time of a job j, and is estimated based on the previous execution of the job on the unit input of the dataset. • For given job j the estimated remaining time is
The MapReduce Constructs and Examples • MapReduce • map(k1,v1)->list(k2,v2) • reduce(k2,list(v2))->list(v3). Figure 1. An example of the climate data processing and analysis workflow in steps
User case: a climate data intensive application • AIRS and MODIS level 1B data over 10 years ~ Petabytes • AIRS, MODIS gridded dataset at 100km ~ Terabytes Fig 2 use case: scientific climate application
Fig 3. left. The performance of gridding MODIS 1 day 48 GB on bluegrit the PPC cluster 524MB RAM each nodes right. The speed up comparison between sequential data processing using Sector/Sphere system on 48GB MODIS data on bluegrit Intel cluster 32GB RAM each node Fig 4. left . Hadoop MapReduce on average AIRS gridded dataset stored on Hbase tables performance on 9 Intel nodes. right. Hadoop gridding with different input dataset size vs non Hadoop gridding (text format)
Summary • We propose a workflow system for scientific applications focus on • Exploiting data parallelism through MapReduce programming model • Automatically optimization scheduling MapReduce jobs to Hadoop or Sector/Sphere clusters • Support for Data Modeling supportdata storage and management through Hadoop database (Hbase), data transfer tools (HDF, NetCDF)