1 / 43

Provenance (for Earth science data)

Provenance (for Earth science data). DKRZ-Seminar, Oct 15 2012. Agenda. What is provenance? Why do we care? Alternative Provenance definitions Gathering and representing provenance information Further resources. What is provenance ?. And what does it mean in our context ?.

jaafar
Download Presentation

Provenance (for Earth science data)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance(for Earth science data) DKRZ-Seminar, Oct 15 2012

  2. Agenda DKRZ seminar: Provenance What is provenance? Why do we care? Alternative Provenance definitions Gathering and representing provenance information Further resources

  3. Whatisprovenance? Andwhatdoesitmeanin ourcontext? DKRZ seminar: Provenance

  4. Provenance: Definition DKRZ seminar: Provenance • For data produced by computer systems: • „The provenance of a piece of data is the process that led to that piece of data.“ (Moreau 2010) • This is a generic base definition. • Other terms: data lineage, history • (provenance also applies to food ingredients, works of art, ...)

  5. Ourcontext… DKRZ seminar: Provenance • Whatisourcontext? • digital-born ESM outputdata • observationaldata, e.g. remote sensingimagery • variousprocessedderivates • Whatareitscharacteristics? • complex, non-standardizedtoolchain • variousprocessingstepsbyvariousactors • nosingleinfrastructure

  6. Use Cases (1) DKRZ seminar: Provenance • Quality ofscientificdata • The processinghistoryof a dataobjectforms an importantpartofitsscientificcontext. • Users whodid not create a dataproduct must beabletounderstandtheimplicationsthatwentintoitscreation. • Data maybereusedmanyyears after creation.

  7. Use Cases (2) DKRZ seminar: Provenance • Reproducibility • Ifprocessingstepsarerecorded in detail, a futureusermayreproducethemtogettheexact same results • May beimpossiblefor ESM outputdatain all itsdepth • Wecan‘tarchivethesupercomputeritself • Yet, trytocaptureasmuchaspossible

  8. Use Cases (3) DKRZ seminar: Provenance • Attribution • Givecredittothe original dataproducer • Citing a DataCite DOI may not beenough • Who isusingdatathatisgeneratedwithDKRZ‘sresources? • Provenancecanenableanyonetotrace back tothe original sourceandproducer

  9. Data-intensive science The FourthParadigm (2009) DKRZ seminar: Provenance • The scenariosgrowmoreimportantwithdata-intensive science • Data issharedacrossscientificcommunities • Focus shiftsfromdataproductiontodataanalysis

  10. Provenanceandthedatalifecycle DKRZ seminar: Provenance • Provenancemaycoverthewholedatalifecycle • Here: focus on earlierparts • datageneration • dataprocessing

  11. Alternative provenancedefinitions There‘smorethanone! DKRZ seminar: Provenance

  12. The task DKRZ seminar: Provenance The taskhere: Develop an understandingofprovenancethatisspecificandpragmatic in ourcontext.

  13. Provenancedefinitions Database context Moreau (2010) DKRZ seminar: Provenance Why-Provenance Where-Provenance Provenanceas a process Provenanceas a DirectedAcyclicGraph (therearemore…)

  14. Why-Provenance Moreau (2010), Buneman et al. (2001) DKRZ seminar: Provenance • Contextofdatabasequeries • Why-Provenance: „tupleswhosepresencejustifies a queryresult“ • „Whyis X partoftheresult?“ • „Becausethequeriedinputdatacontainstuple A“

  15. Where-Provenance Moreau (2010), Buneman et al. (2001) DKRZ seminar: Provenance • A Website displays a typo in a menuentry • Whatisthedatabasefieldthisstringcomesfrom? • This may not bethedatabasedirectlyconnectedtothewebsite, but e.g. a citationdatabasemaintainedelsewhereandqueriedbythesite • Helpstoilluminatethecopyingofinformationacrossdatabases.

  16. Provenanceas a process (1) Moreau (2010) DKRZ seminar: Provenance • The computationthatresulted in thedata • Any • data • event • useraction • thatcanbeconnectedtothedatathrough a computationalprocesspotentiallybelongstoitsprovenance

  17. Provenanceas a process (2) Moreau (2010) DKRZ seminar: Provenance • ESM execution: The contextcangetveryvast. • modelsourcecode • all parameters, modelconditions, forcings • username, libraries, OS version, parallel architecture • ...

  18. Provenanceas a DAG DKRZ seminar: Provenance Whatis a DirectedAcyclic Graph? ... Whatis a graph?

  19. Whatis not a graph? DKRZ seminar: Provenance These areno (mathematical) graphs.

  20. Whatis a graph? Wikipedia DKRZ seminar: Provenance This is a graph. Graphs areeverywhere.

  21. Whatis a graph? DKRZ seminar: Provenance • Graph theory: A graphconsistsof a setofnodes(vertices) anda setofedges • G = (V, E) • any e ∈ E is an unorderedset (v1, v2); v1, v2∈ V

  22. Whatis a directedgraph? DKRZ seminar: Provenance • Directedgraph • directededges • set (v1, v2) isordered; theedgeisdirectedfrom v1to v2

  23. Whatis a DirectedAcyclic Graph? DKRZ seminar: Provenance • DirectedAcyclic Graph (DAG) • directededges • nocyclesallowed!

  24. Provenanceas a DirectedAcyclic Graph t cdo Moreau (2010) DKRZ seminar: Provenance • Simplified, data-centricview • Nodes representdataitems • Edgesrepresent derivative operations • „predecessor“, „successor“, „derived-from“, ... • uni- orbidirectional • Level ofdetaildepends on theusecases

  25. Provenanceinformationispartofthemetadata DKRZ seminar: Provenance • Provenanceinformationispartofthemetadata • Curatingthismetadataistediousandpragmaticallyimpossible • Itisagreedthatprovenancegathering must beautomated • View a provenancerecordassomethingcreated on thefly, ratherthan a storeddocument • provenanceistheresultof a queryoverprocessassertions(Moreau 2010)

  26. Applications Whatis out theretogather, represent, exploitprovenance? DKRZ seminar: Provenance

  27. Gatheringprovenanceinformation DKRZ seminar: Provenance • Manytoolsexisttocaptureprovenancethrough an embracingsystem (particularlyworkflowsystems) • Lots ofresearchandacademicprototypes • A listisavailableat http://www.openprovenance.org • Provenanceinformationmaybeaggregated in a specificdatabase(provenancestore)

  28. Gatheringprovenance: workflowsystems DKRZ seminar: Provenance • Scientific workflowsystems • e.g. Taverna, Kepler, VisTrails, ... • Advantages • Potentiallygoodcoverage • Improvescollaborationandknowledgetransfer • Disadvantages • all ornothing • high migrationcosts • do not matchuser‘s traditional workflow

  29. Gatheringprovenanceinformation - alternative DKRZ seminar: Provenance Forus: nooverarchingsystempossible in themid-term Alternative idea: captureproveance in smallpiecesbyenhancingtheexistingtoolsoftheresearchenvironments

  30. Gatheringprovenance: in smallsteps DKRZ seminar: Provenance • Advantages • resultsscalewellwithimplementationeffort • potentiallysmallchangetouserworkflow • Disadvantages • fragmentarycoverage • incoherent, potentiallychaoticinformation • mandatesstrictstandardization

  31. Representationofprovenanceinformation DKRZ seminar: Provenance • Provenanceinformationcanberepresented in manyformats • human-interpretable: human-adressed log files, freetext • machine-interpretable: traversablegraphs • simple (A derivedfrom B) • complex (semanticgraph, Open Provenance Model)

  32. Machine-interpretablerepresentationrequired? DKRZ seminar: Provenance • Machine-interpretablerepresentationofprovenance: whatisthedesiredlevelofdetail? • moredetailmoresophisticatedrepresentationlanguagerequired • So thecorequestionis: doesyourusecaserequiresophisticatedmachine-interpretablerepresentation? • remember: machine-interpretabilityisfortools, not forhumans

  33. Representationformats DKRZ seminar: Provenance • Machine-interpretablerepresentationofgatheredinformation • Designedto span acrosssystems • Standardizedrepresentationformats • 2010: Open Provenance Model (OPM) • 2012: W3C PROV (Draftstatus) • The same setofpeopleareinvolved

  34. OPM: Agents, ProcessesandArtifacts DKRZ seminar: Provenance OPM and W3C arebothgraph-basedrepresentations In thefollowing: The Open Provenance Model (OPM) in brief

  35. Open Provenance Model: baseelements Moreau et al. (2010b) DKRZ seminar: Provenance • Agents • cdo, user • Processes • calculatemonthlymeans • Artefacts / Entities • inputandoutputdata, log file • These aremodelledin thepast.

  36. Motivationsfor W3C PROV DKRZ seminar: Provenance W3C PROV continuestheworkof OPM Roughly: alignitto RDF/OWL andotherSemantic Web standards

  37. Queryingandviewingprovenance DKRZ seminar: Provenance • Exploitencodedprovenanceinformation? • visualization • querying

  38. Summary: whatisprovenance? DKRZ seminar: Provenance • Tosummarizethisparticularview: • Provenanceistheresultof a queryoverprocessassertions. • Such assertionscan in theirsimplest form berepresentedthrough an (evergrowing) DAG. • Includingmoredetailsrequires a processviewthatembraces a larger context. • Provenanceinformationissubjectto LTA

  39. Bottom-upapproach DKRZ seminar: Provenance • Suggestion: Start smalland simple. • Collectsmallpiecesofinformation • automatically, infrastructuretask, do not burdendataproducer • Providetoolstogatherintelligencefromthisheapofinformation • DAG-view hasobvious simple queryingmodel (tree) andis easy tounderstandandexplain • Buildthe DAG as a baselayer, thenattachrichercontexttothenodesoredges

  40. Andthensome... DKRZ seminar: Provenance • Construct a provenancegraphusing Persistent Identifiers? • PhDtopic • DKRZ-Seminar on Persistent Identifiers • Wednesday, 17 Oct • 14-16h • Same place (R34)

  41. Further reading DKRZ seminar: Provenance • Luc Moreau: The FoundationsforProvenance on the Web (2010) • maininfluenceis Web science • summarizestheresearchfieldverywell • includes an extensive bibliography • OPM specification: http://www.openprovenance.org • W3C PROV:http://www.w3.org/TR/prov-primer/

  42. The End. Thankyouforyourattention. DKRZ seminar: Provenance

  43. References DKRZ seminar: Provenance • Moreau (2010): The FoundationsforProvenance on theWeb, doi:10.1561/1800000010 • pre-print: http://eprints.soton.ac.uk/268176/ • Moreau et al. (2010b): The Open ProvenanceModel Core Specification(v1.1), doi:10.1016/j.future.2010.07.005 • The FourthParadigm, 2009, Microsoft Research • Buneman et al. (2001): WhyandWhere: A characterizationof Data Provenance, doi:10.1007/3-540-44503-X_20

More Related