220 likes | 227 Views
This research aims to address the challenges in managing and reproducing nanoscopy scientific workflows by developing an automated provenance management system integrated into a scientific data repository.
E N D
Enabling scientific data reproducibility with automated provenance management in a scientific data repository Ajinkya Prabhune
Introduction: Nanoscopy Localization image of the same breast cancer cell Microscope image of the whole breast cancer cell Scientific Perspective • Investigation on “aggressive B-cell lymphomas” • Novel imaging method capable of capturing images at nanometer resolution • Microscopy technique – Spectral Precision Distance Microscopy (SPDM) • High-resolution microscopes at Uni. Heidelberg and Uni. Mainz (DAQ) Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Example nanoscopyscientific workflow Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Motivation Managing the data lifecycle of nanoscopy workflow (~150 – 200 TB) Nanoscopy scientific workflows are, • defined and executed manually on local machines • need to be manually executed multiple times for validating the results (experiment reproducibility) • frequently updated for improving results (tracing workflow evolution) Associated provenanceis not captured, preventing • repeatability and reproducibility of the experiment • tracing the evolution of workflow • analyzing workflow and provenance for determining quality of results • comparing workflows for detecting flaws and redundancies Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Research Questions (Goal) Goal 1: Identify a W3C standard provenance model, capable of modeling both the workflow and its associated provenance. Goal 2: Enable the automated extracting, modelling and storing of provenance information in a W3C accepted standard provenance model? Goal 3: Develop and integrate a provenance management system in a scientific data repository system (NORDR)? • Which are the questions that can enable researchers to analyze the workflow and its provenance to reproduce the results? Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Introduction: Scientific workflow What is a scientific workflow? • A set of systematically organisedprocessing steps used to accomplish an in silico scientific task. What is a workflow language? A language that allows describing the processing tasks and the execution order. Example BPEL, SCUFL, MoML. What is a workflow management system (WFMS) or workflow engine? • A software that coordinates the processing steps defined in the workflow. Example ApacheODE, Taverna, Kepler Why are workflows and workflow management system important? • Automate the execution of complex tasks in a scientific experiment. • Enable repeatability of experiments and reproducibility of results. Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Introduction: Provenance As per the oxford dictionary provenance is “The place of origin or earliest known history of something” Comprehensive provenance comprises two aspects • Prospective provenance is the workflow definition comprising the various processing steps that are to be executed (execution plan or recipe) • Retrospective provenance is the details of the actual runtime events that occurred during the execution of the workflow Various provenance standards are available for modelling the provenance information, • W3C OPM/PROV standards allows modelling of only the retrospective provenance • W3C ProvONE standards allows modelling of both prospective and retrospectiveprovenance AjinkyaPrabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Introduction: ProvONE standard ProvONEprovenance model is a W3C standard Easy to integrate existing metadata vocabulary for modeling the provenance contextual information (for example Dublin Core terms) Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Vocabulary mapping between workflow specifications and ProvONE • To allow a precise and a complete conversion from a given workflow specification to ProvONE • Each term defined in the workflow specification should be mapped to a term in ProvONE • The vocabulary mappings are the basis for the Prov2ONEalgorithm • Formal definition of the mappings through RDF Mapping Language (RML) or SKOS mapping vocabulary Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Vocabulary mapping: BPEL to ProvONE Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Vocabulary mapping: SCUFL to ProvONE Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Vocabulary mapping: MoML to ProvONE Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Prov2ONE:BPEL Prov2ONE:BPEL algorithm is segregated in two components • Component 1 maintains the structural definition of the workflow • Component 2 constructs the ProvONE prospective graph Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Prov2ONE:SCUFL The Prov2ONE:SCUFL algorithm has a single component • In the first part ProvONE processes and their associated InputPorts and OutputPorts are drawn • In the second part based on SCUFL datalinks, ProvONEDataLinksand SeqCtrlLinkare drawn Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Prov2ONE:MoML The Prov2ONE:MoML algorithm has a single component • In the first part based on the property MoML entity (Actor or Director) a ProvONEWorkflow node or Process node is drawn • In the second part the MoML relations are iterated and ProvONEDataLinks and SeqCtrlLink are drawn Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Provenance management framework for scientific data repository. Prov2ONE algorithm for automatically generating the ProvONEprospective provenance graph Services for extracting retrospective provenance from NORDR and workflow engine Dedicated graph database for storing the ProvONE provenance graphs (ArangoDB) Provenance challengequeries implemented for querying the retrospective provenance Five novel queries for tracing workflow evolution, workflow interoperability and analysing workflow definition Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Conclusion • W3C ProvONE standard for handling prospective and retrospective provenance • Automated provenance modelling for BPEL, SCUFL, MoMLusing Prov2ONE algorithm • Storing provenance in dedicated graph database (ArangoDB) • IPAW provenance challenge queries implemented for ProvONE • Five novel queries for retrieving the prospective provenance • Capture the evolution of a workflow • Compare, analyse, evaluate workflows from heterogeneous workflow • Provenance management framework for scientific data repositories (NORDR) Links: RESTAPI ProvArangoDB Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Scientific Contributions • Prabhune, A., Stotzka, R., Jejkal, T., Hartmann, V., Bach, M., Schmitt, E., Hausmann, M. and Hesser, J., 2015, March. An optimized generic client service API for managing large datasets within a data repository. In Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on (pp. 44-51). IEEE. • Prabhune, A., Zweig, A., Stotzka, R., Gertz, M. and Hesser, J., 2016, June. Prov2ONE: An Algorithm for Automatically Constructing ProvONE Provenance Graphs. In International Provenance and Annotation Workshop(pp. 204-208). Springer International Publishing. • Grunzke, R., Hartmann, V., Jejkal. T., Prabhune. A., et al. 2016, April. Towards a Metadata-driven Multi-community Research Data Management Service. In International Workshop on Science Gateways. In IWSG 2016 • Chandna, S., Tonne, D., Jejkal, T., Stotzka, R., Krause, C., Vanscheidt, P., Busch, H. and Prabhune, A., 2015, February. Software workflow for the automatic tagging of medieval manuscript images (SWATI). In SPIE/IS&T Electronic Imaging (pp. 940206-940206). International Society for Optics and Photonics. • Jung, C., Gasthuber, M., Giesler, A., Hardt, M., Meyer, J., Prabhune, A., Rigoll, F., Schwarz, K. and Streit, A., 2015. Progress in Multi-Disciplinary Data Life Cycle Management. In Journal of Physics: Conference Series (Vol. 664, No. 3, p. 032018). IOP Publishing. • Prabhune, A., Zweig, A., Stotzka, R., Gertz, M. and Hesser, J., 2016. Automating the construction of ProvONE provenance graph for scientific workflow specifications. ACM TOIT Journal (In review) • Prabhune, A. Keshav. A, Ansari. H, A., Stotzka, R., Gertz, M. and Hesser, J., 2016. MetaStore: Metadata Framework for Scientific Data Repository Big Data 2016 (In review) • Prabhune, A., Rainer, S., Gertz, M., Zheng, L., Hesser, J., 2016. Managing Provenance of Medical Datasets, An Example Case for Documenting the Workflow for Image Processing. Health INF (in review) AjinkyaPrabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Thank you AjinkyaPrabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Literature Review Domain specific provenance system • Chimera is a virtual data tracking and generation system that is capable of auditing and tracing the provenance of the derived data. Workflow are defined using VDL and provenance is captured VDS [8] • The myGrid project provides a middleware for exposing Grid technologies. Workflow are defined in XSCUFL and provenance is is captured in logs (services invoked, parameters, data derivation and time) • Provenance Aware Service Oriented Architecture (PASOA) aims at building interoperable provenance infrastructure using Provenance Recording Protocol (PREP) • Lineage Information Program (LIP) aims at managing the provenance of spatial databases in semantic networks. AjinkyaPrabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Literature Review Provenance handling in Workflow Management Systems (WfMS) • Vistrails provides data process management support for exploratory computational tasks, workflows are defined in Vistrail specification and provenance is captured in RDBMS • TrianaWfMS captures provenance in a proprietary provenance format and stores in a RDBMS. Provenance of executed components, parameters, input/output data is captured • Kepler is based on Ptolemy II engine. Kepler workflows are defined in MoML specification and provenance is stored in proprietary data model in a RDBMS • Taverna uses SCUFL to define worklow and stores provenance in proprietary data models in a RDBMS. A plugin is provided for exporting the provenance in PROV standard Ajinkya Prabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository
Introduction: Data-driven vs. Control-driven AjinkyaPrabhune – Enabling scientific data reproducibility with automated provenance management in scientific data repository