1 / 8

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications. By Phuong Nguyen and Milton Halem phuong3 or halem @ umbc.edu. Introduction. Scientific workflow systems

kamala
Download Presentation

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or halem @umbc.edu

  2. Introduction • Scientific workflow systems • Support the integration of components for data acquisition, transformation, and analysis to build complex data-analysis • Computation is often used to derive new data from old one • Many scientific workflow systems are data-driven or even • Computational steps are represented as nodes and data between these tasks is made explicit using edges between them • Scientific workflow advantages compare to script base approaches • Component model • Data source discovery • Provenance framework • Parallelism • Limitations • Little Support for Data Modeling, NetCDF, XML limited in Kepler and Triana • Scientific workflows should tolerate certain changes in the structure of its input data and easy to modify • Parallelism • Optimization is not Performed Automatically

  3. The MapReduce workflow system • Exploiting data parallelism through MapReduce programming model • We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of • workflow design tools: A simple workflow design C++ API, will support XML design tool. • Automatically optimization: a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. • Support for Data Modeling: data parallel processing scales out linearly with Hbase data storage and management • (random access, real time read/write access to data intensive applications) • data transfer tools ( for HDF, NetCDF) • Use case: A climate satellite data intensive processing and analysis application.

  4. The proposed workflow system w=<J, D, E> • Where w is the workflow J is a set of jobs as vertices in a DAG graph. E is a set of directed edges. D is a set of datasets. The input dataset I and output dataset O for the workflow w are special types of D. • J1 =>J2 a directed edge states that job j1 produces a dataset d and job j2 consumes it as an input. • Let is the estimated execution time of a job j, and is estimated based on the previous execution of the job on the unit input of the dataset. • For given job j the estimated remaining time is

  5. The MapReduce Constructs and Examples • MapReduce • map(k1,v1)->list(k2,v2) • reduce(k2,list(v2))->list(v3). Figure 1. An example of the climate data processing and analysis workflow in steps

  6. User case: a climate data intensive application • AIRS and MODIS level 1B data over 10 years ~ Petabytes • AIRS, MODIS gridded dataset at 100km ~ Terabytes Fig 2 use case: scientific climate application

  7. Fig 3. left. The performance of gridding MODIS 1 day 48 GB on bluegrit the PPC cluster 524MB RAM each nodes right. The speed up comparison between sequential data processing using Sector/Sphere system on 48GB MODIS data on bluegrit Intel cluster 32GB RAM each node Fig 4. left . Hadoop MapReduce on average AIRS gridded dataset stored on Hbase tables performance on 9 Intel nodes. right. Hadoop gridding with different input dataset size vs non Hadoop gridding (text format)

  8. Summary • We propose a workflow system for scientific applications focus on • Exploiting data parallelism through MapReduce programming model • Automatically optimization scheduling MapReduce jobs to Hadoop or Sector/Sphere clusters • Support for Data Modeling supportdata storage and management through Hadoop database (Hbase), data transfer tools (HDF, NetCDF)

More Related