1 / 21

Data Management Challenges of Data-Intensive Scientific Workflows

Data Management Challenges of Data-Intensive Scientific Workflows. Ewa Deelman Ann Chervenak University of Southern California Information Sciences Institute. Generating mosaics of the sky (Bruce Berriman, Caltech).

jamil
Download Presentation

Data Management Challenges of Data-Intensive Scientific Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management Challenges of Data-Intensive Scientific Workflows Ewa Deelman Ann Chervenak University of Southern California Information Sciences Institute Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  2. Generating mosaics of the sky (Bruce Berriman, Caltech) Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu *The full moon is 0.5 deg. sq. when viewed form Earth, Full Sky is ~ 400,000 deg. sq.

  3. Issues Critical to Scientists • Reproducibility of scientific analyses and processes is at the core of the scientific method • Scientific versus Engineering reproducibility • Scientists consider the “capture and generation of provenance information as a critical part of the workflow-generated data” • “Sharing workflows is an essential element of education, and acceleration of knowledge dissemination.” NSF Workshop on the Challenges of Scientific Workflows, 2006, www.isi.edu/nsf-workflows06 Y. Gil, E. Deelman et al, Examining the Challenges of Scientific Workflows. IEEE Computer, 12/2007 Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  4. Data lifecycle in Workflows Workflow Creation Workflow Reuse Workflow Mapping and Execution Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  5. Workflow Creation • Design a workflow • Find the right components • Set the right parameters • Find the right data • Connect appropriate pieces together • Find the right fillers • Support both experts and novices • Record the workflow creation process (creation provenance—for example VisTrails) Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  6. Challenges in user experiences Users’ expectations vary greatly High-level descriptions Detailed plans that include specific resources Users interactions can be exploratory Or workflows can be iterative Users need progress, failure information at the right level of detail—particularly challenging in distributed environments There is no ONE user but many users with different knowledge and capabilities It is difficult to develop community standards so that data and computations can be uniformly discovered Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  7. Workflow Mapping and Execution—Providing Abstraction • Workflow Data and Component Selection • Find locations, possible resources to support the computations • Perform a correct and efficient mapping • Schedule data movement • Management of data dependencies • Release jobs when ready • Transfer data across sites • Management of Data Transfers • And failures • Asynchronous Data Placement • Maybe it is better to have a separation of concerns between data movement and computation management Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  8. Workflow Mapping and Execution Issues cont’d • Data Storage • Workflows can access and generate large amounts of data • Storage is limited • Hard to find out storage quotas/free space (often managed at a VO level) • No good way to reserve storage • Data Management inside the Resource • Poor NFS performance when many accesses occur • Need to run on a local disk • Data staging within a resource • Virtual Data and Data Reuse • Recognize when intermediate data already exist • Determine whether it is more efficient to access the existing data rather than recompute it Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  9. Pegasus-Workflow Management Systemest. 2001 • Leverages abstraction for workflow description to obtain ease of use, scalability, and portability • Provides a compiler to map from high-level descriptions (workflow instances) to executable workflows • Correct mapping • Performance enhanced mapping • Provides a runtime engine to carry out the instructions (Condor DAGMan) • Scalable manner • Reliable manner In collaboration with Miron Livny, UW Madison, funded under NSF-OCI SDCI Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  10. Mapping Correctly Select where to run the computations Apply a scheduling algorithm Schedule in a data-aware fashion (data transfers, amount of storage) The quality of the scheduling depends on the quality of information Transform task nodes into nodes with executable descriptions Execution location, environment variables initializes, appropriate command-line parameters set Select which data to access and modify workflow Add stage-in nodes to move data to computations Add stage-out nodes to transfer data out of remote sites to storage Add data transfer nodes between computation nodes that execute on different resources Add nodes to create an execution directory on a remote site Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  11. Mapping efficiently Cluster compute nodes in small granularity applications Add data cleanup nodes to remove data from remote sites when no longer needed reduces workflow data footprint Add nodes that register the newly-created data products Provide provenance capture steps Information about source of data, executables invoked, environment variables, parameters, machines used, performance Scale matters--today we can handle: 1 million tasks in the workflow instance (Southern California Earthquake Center--SCEC) 10TB input data (Laser Interferometer Gravitational-Wave Observatory--LIGO) Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  12. Virtual Data and Data Reuse Tension between data access and data regeneration Keeping track of data as it is generated supports workflow-level checkpointing Need to be careful how reuse is done Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  13. Efficient data handling Joint work with Rizos Sakellariou, Manchester University Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu • Input data is staged dynamically • New data products are generated during execution • For large workflows 10,000+ files • Similar order of intermediate and output files • Total space occupied is far greater than available space—failures occur • Solution: • Determine which data are no longer needed and when • Add nodes to the workflow do cleanup data along the way • Issues: • minimize the number of nodes and dependencies added so as not to slow down workflow execution • deal with portions of workflows scheduled to multiple sites

  14. 56% improvement 26% improvement • Full workflow: • 185,000 nodes 466,000 edges • 10 TB of input data • 1 TB of output data. 166 nodes LIGO Workflows Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  15. Interaction Between Workflow Planner and Data Placement Service for Staging Data Compute Cluster Storage Elements Jobs Data Transfer Workflow Tasks Setup Transfers Staging Request Workflow Planner Data Placement Service (Pegasus) (Data Replication Service) Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  16. Montage Workflow: Execution Times with Additional 20 MB Input Files • With asynchronous data staging, execution time is reduced by over 46% Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  17. Data lifecycle in Workflows Workflow Creation Workflow Reuse Workflow Mapping and Execution Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  18. Challenges in Workflow reuse and sharing • How to find what is already there • How to determine the quality of what’s there • How to invoke an existing workflow • How to share a workflow with a colleague • How to share a workflow with a competitor Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  19. Sharing: the new frontier • MyExperiment in the UK (University of Manchester), a repository of workflows http://www.myexperiment.org/ • How do you share workflows across different workflow systems? • How to write a workflow in Pegasus and execute in ASKALON? • NSF/Mellon Workshop on Scientific and Scholarly Workflow, 2007 https://spaces.internet2.edu/display/SciSchWorkflow/Home • How do you interpret results from one workflow when you are using a different workflows system (provenance-level interoperability) • Provenance challenge http://twiki.ipaw.info/ • Open provenance model http://eprints.ecs.soton.ac.uk/14979/1/opm.pdf Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  20. Issues Critical to Scientists • Reproducibility of scientific analyses and processes • Services for finding the right analysis/workflow • Services for finding the right data • Provenance capture • Registering all pertinent data generation steps • Providing the right level of abstraction • Workflow Sharing • User tools for upload and discovery of relevant works • Semantic technologies can play an important role, but needs investment for the Computer Science community and the domain sciences • Reliable “launch and forget” workflow execution is necessary for workflow adoption by scientists Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

  21. Relevant Links Pegasus: pegasus.isi.edu, Gaurang Mehta, Mei-Hui Su, Karan Vahi DAGMan: www.cs.wisc.edu/condor/dagman, Miron Livny, Kent Wenger, and the Condor team (Wisconsin Madison) Gil, Y., E. Deelman, et al. Examining the Challenges of Scientific Workflows. IEEE Computer, 2007. Workflows for e-Science, Taylor, I.J.; Deelman, E.; Gannon, D.B.; Shields, M. (Eds.), Dec. 2006 Montage: montage.ipac.caltech.edu/, Bruce Berriman, John Good, Dan Katz, and Joe Jacobs (Caltech, JPL) LIGO: www.ligo.caltech.edu/, Kent Blackburn, Duncan Brown, Stephen Fairhurst, Scott Koranda (Caltech,UWM) Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu

More Related