80 likes | 202 Views
This paper presents a MapReduce workflow system designed to support the integration and orchestration of scientific data-intensive applications. By leveraging data parallelism, the system facilitates the design and execution of complex workflows that incorporate various components for data acquisition, transformation, and analysis. It addresses key challenges such as job scheduling, data modeling, and support for diverse formats like HDF and NetCDF. A use case involving climate satellite data illustrates the system's efficiency in processing large datasets while optimizing performance through automated scheduling in Hadoop or Sector/Sphere frameworks.
E N D
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or halem @umbc.edu
Introduction • Scientific workflow systems • Support the integration of components for data acquisition, transformation, and analysis to build complex data-analysis • Computation is often used to derive new data from old one • Many scientific workflow systems are data-driven or even • Computational steps are represented as nodes and data between these tasks is made explicit using edges between them • Scientific workflow advantages compare to script base approaches • Component model • Data source discovery • Provenance framework • Parallelism • Limitations • Little Support for Data Modeling, NetCDF, XML limited in Kepler and Triana • Scientific workflows should tolerate certain changes in the structure of its input data and easy to modify • Parallelism • Optimization is not Performed Automatically
The MapReduce workflow system • Exploiting data parallelism through MapReduce programming model • We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of • workflow design tools: A simple workflow design C++ API, will support XML design tool. • Automatically optimization: a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. • Support for Data Modeling: data parallel processing scales out linearly with Hbase data storage and management • (random access, real time read/write access to data intensive applications) • data transfer tools ( for HDF, NetCDF) • Use case: A climate satellite data intensive processing and analysis application.
The proposed workflow system w=<J, D, E> • Where w is the workflow J is a set of jobs as vertices in a DAG graph. E is a set of directed edges. D is a set of datasets. The input dataset I and output dataset O for the workflow w are special types of D. • J1 =>J2 a directed edge states that job j1 produces a dataset d and job j2 consumes it as an input. • Let is the estimated execution time of a job j, and is estimated based on the previous execution of the job on the unit input of the dataset. • For given job j the estimated remaining time is
The MapReduce Constructs and Examples • MapReduce • map(k1,v1)->list(k2,v2) • reduce(k2,list(v2))->list(v3). Figure 1. An example of the climate data processing and analysis workflow in steps
User case: a climate data intensive application • AIRS and MODIS level 1B data over 10 years ~ Petabytes • AIRS, MODIS gridded dataset at 100km ~ Terabytes Fig 2 use case: scientific climate application
Fig 3. left. The performance of gridding MODIS 1 day 48 GB on bluegrit the PPC cluster 524MB RAM each nodes right. The speed up comparison between sequential data processing using Sector/Sphere system on 48GB MODIS data on bluegrit Intel cluster 32GB RAM each node Fig 4. left . Hadoop MapReduce on average AIRS gridded dataset stored on Hbase tables performance on 9 Intel nodes. right. Hadoop gridding with different input dataset size vs non Hadoop gridding (text format)
Summary • We propose a workflow system for scientific applications focus on • Exploiting data parallelism through MapReduce programming model • Automatically optimization scheduling MapReduce jobs to Hadoop or Sector/Sphere clusters • Support for Data Modeling supportdata storage and management through Hadoop database (Hbase), data transfer tools (HDF, NetCDF)