1 / 43

Data Archiving and Data Stewardship

Data Archiving and Data Stewardship. Pirjo-Leena Forsström, Heikki Helin, Kimmo Koivunen, Juha Lehtonen, Kuisma Lehtonen CHEP 2013. Index. Background and motivation What is preservation ? For whom and what ? How is it done ? Pilots In conclusion. Source : wikipedia PD

pascal
Download Presentation

Data Archiving and Data Stewardship

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Archiving and Data Stewardship Pirjo-Leena Forsström, Heikki Helin, Kimmo Koivunen, Juha Lehtonen, Kuisma Lehtonen CHEP 2013

  2. Index • Background and motivation • What is preservation? • For whom and what? • How is itdone? • Pilots • In conclusion Source: wikipedia PD Imageresources

  3. Digitalization of research and cultural processes • Typical: • Growingvolume of data and sources • Complexity of data processing • data is dynamic • Highdemand of data • Complicatedinteractionbetweenusers and data • Mostimportantchallenges: • Managing and processingexponentiallygrowingdatasets • Significantacceleration in analysiscycle • Combining data sources Source: wikipedia PD Imageresources

  4. Long-term preservation of research and cultural heritage data • Preservation of digital information is the core of research and cultural organization's activity. • At this time, there has been no controlled and functional way of handling digital information in the long term. • Long-term preservation of digital data means the reliable preservation of digital information for several decades or even hundreds of years. • Equipment, software, and file formats will become outdated, but despite this the information must be preserved in understandable form.

  5. Digital processesbreakeasily • Short-periodfunding • Software lifecycle: code, interfaces, formats… • Dependent on expertknowledge • Thindocumentation and metadata Source: : wikipedia PD Imageresources

  6. FutureAim: Research and culturaldata routinelydeposited in well-documentedform, regularly and easilyconsulted and analyzed, and openlyaccesiblewhilesuitablyprotected and reliablypreserved. Needs PERSISTENCE • Coherent organizational framework? • Ownership • Curation • Flexibletechnicalarchitecture: • Standard openprotocols and interfaces • Flexibleuseraccess, analysis and visualization of data • Addressissues of autenthication, authorization, security • Supportsworkflows

  7. Persistence Holy grail of preservation & information management more generally • What does persistence mean? • How long it persists? • What persists? • What is “guaranteed” to be accessible? Source: : wikipedia PD Imageresources

  8. Digital Curation • Curation : The activity of, managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. • Archiving : A curation activity which ensures that data is properly selected, stored, can be accessed and that its logical and physical integrity is maintained over time, including security and authenticity. • Preservation : An activity within archiving in which specific items of data are maintained over time so that they can still be accessed and understood through changes in technology. • Digital curation : looking after and somehow "adding value" to digital data, ensuring its current and future usefulness. This probably implies creating some new data from the existing, in order to make the latter more useful and "fit for purpose".

  9. Data Collections • Research Collections: Authors individual investigators and investigator teams. Maintained to serve immediate group participants during the project lifetime. May not conform to any data standards. • Resource Collections : Authored by a community of investigators (specific domain of science or engineering) with community-level standards. Lifetime from mid- to long term. • Reference Collections : Authored by and serve large segments of science and engineering community. Using well-established and comprehensive standards.

  10. Preservationmethods • Preserving the originallook-and-feel • Emulation • Development of emulators to new platforms etc. • Activetesting and technologywatch • Preserving the content • Migration • Formatdevelopmentwatch (formatlibraries) • Development of transformationprocesses, testing, implementation, monitoring • Preparation for recoveries • Preserving the bits • Integrity • Filevalidation and monitoring • Management of copies • Bothobjects and metadata

  11. Migration • Migrationenables the utilization of digitalobjects in new ICT environment • Specialcareneeded to preserveinformationcontent: planning, testing and validationwithcare Migration FORMAT FORMAT Application Application Operatingsystem Operatingsystem Hardware Hardware

  12. Emulation FORMAT • Emulationenables the use of oldsolution on new hardware environment • Emulationhas to solvehow the informationcanbeused in context of new data production (copy-paste) FORMAT Application Application Operatingsystem Operatingsystem Emulator Hardware Hardware

  13. Preservationsolutionhas to manage • Authencity • Integrity • Technologychange • Risk management • Preservation metadata management • Scalability of the solution

  14. National Digital Library

  15. Ministry of Education: national collaboration and roadmap for research Data • National InformationInfrastructureservices for research • TTA provides 2012-2013 • Storagesolution IDA • Metadata catalogue KATA • Long TermPreservation 2015 – • Pilotsstarting 2014

  16. ResearchInformationinfrastructure • Embeddedness • Transparency • Reach of scope • Links with conventions of practice. • Embodiment of standards • Build on an open platform: Infrastructure does not grow de novo; it wrestles with the “inertia of the installed base” and inherits strengths and limitations from that base.

  17. Who? ResearchorganizationsMuseums, Archives, Libraries

  18. What? - Data Volumes

  19. Research Data volumes, firstquess 31pt 11,5pt

  20. Digital Preservation Services (1/3) • Digital preservation system will be built according to Open Archival Information System (OAIS) reference model • Preparation and ingest services • Metadata specification • Preservation plan preparation • Submission information package (SIP) packaging service • SIP ingest and validation • Processing of acceptable file formats (for transfer) to recommended file formats

  21. Digital PreservationSolution • the ISO OAIS Reference Model for an OAIS. This reference model is defined by recommendation CCSDS 650.0-B-1 of the Consultative Committee for Space Data Systems;[1] this text is identical to ISO 14721:2003. Source: Long-Term Preservation of Digital Documents. 2006. doi:10.1007/978-3-540-33640-2. ISBN 978-3-540-33639-6. Public Domain.

  22. Digital Preservation Services (2/3) • Preservation services • Archival information package (AIP) from SIP • Development and monitoring of preservation methods and environment • Preservation actions: integrity monitoring, refreshment, replication, migration • Geographical distribution • E.g. Espoo and Kajaani – 550 km distance between • http://goo.gl/maps/J5XkX • Digital information search functions • Dissemination information package (DIP)

  23. Digital Preservation Services (3/3) • Digital information management services • Metadata updates • Digital object updates • Removal of digital objects • Preservation plan updates • Advisory and support services • Usage support of the services and the digital preservation system • Administrative support • Training and information services

  24. Roadmap • 1st site: Ingest and bitlevelpreservationservicestartsdecember 2013 • 3 copies in 3 different media • 2nd site 2016 • Geographicaldistribution • Riskminimization, differentprocessetc • Lesscopies? • DarkArchive • No internet • Minimization of humanerror DarkArchive 2nd site 1st site

  25. Reseachcommunitites, archives, libaries, museums Single digitalobjectpreservation Preservationplanning Management of copies Apllication of preservationmethod metadata Content: understanding, significance Preseravtionservice format Storage hardware storage media Long TermUse

  26. How? - NationallyUnifiedStructure

  27. Digital PreservationSolution • Specificationsdefined for preparing and creatingunifiedSubmissionInformationPackages (SIPs) with a redifed METS schema • A closed set of acceptablefileformats • From the acceptablefileformats, somearerecommendedformats, someacceptable for transfer

  28. Technicalarchitecture • Consists of severallayers • Theselayersaredescribed in • Applicationarchitecture • Infrastructurearchitecture

  29. Ingest

  30. Ingest • Weutilize 24 differentOpenSourcecomponents • Formatchecks: 11 components (JHOVE1, JHOVE2, FITS, Epubcheck, Apache ODF Toolkit, Officetron, FLAC, Pngcheck, warc-tools, Ms Office binaryFileFormatValidator, MP3val) • Missingpartsdonein-house • LOT of integrationwork (technologywatch, testing, reporthandling etc.)

  31. Ingestmessages

  32. Hardware infrastructure • Separatedatabaselayer • 3 copies on 3 different media • Distributedstoragegroup of storagenodes, linkedtogether via tcp/ip & storage software running on operatingsystem => no enterprisestoragesolution

  33. In practise Finnish common digitalpreservationsystem: • Highlights the ownership and roles • Needsactionsstartingfrompolicylevel and similarlyfromoperationallevel • draws practices of partner organizations closer to each other • reduces the costs and fragmented nature of the ICT systems • intensifies cooperation However: • Common specification (profile) will be most likely updated several times in the future. • requires a lot of discussion and collaboration.

  34. Actions on manylevels Goal: the extensiveuse of publiclyfunded data in research Discovery Availability Usability Data life cycle The requiredactivities Coordination Role division Attitude Collaboration Interoperability Resources The data infrastructurebuilding blocks Politicalwill and data policy Legislation renewal Development of practices • Clearrules • Removing the ambiguity • Easing the availability and • usage of research data • Politicalalignment • Common goals • Principles and the division of • responsibilities • Resourceplanning • Coordinationenhancements • Data inventaries • Terms of use • Rules for financing and researchprin- • ciples • The strenghtening of skills • Interoperablesystems • Common services • Thematicapplications • Long termpreservation • Investments Ministries,sponsors, Data producing and governing organizations, univer- sities, researchinstitu- tions, researchers Ministries, government, statecouncil Ministries, Data protection commissioner, the parliament The ministry of culture and education, researchorganizations, Infrastructureactors Main actors

  35. Workloaddistribution Preservation of culturalheritage Preservation of scientific data

  36. Pilots • To understand the preparedness of partnerorganization • Tenpilots in total, threelibraries, fivearchives, twomuseums Source: On preparedness of memoryOrganizations for Ingesting Data. J.Lehtonen, H.Helin, K.Koivunen, K.Lehtonen, iPRES2013.

  37. Overallpackagingresults A gradewasgivenfoteach SIP in eachvalidationstep 0. The part is missingordoesnotfollow the specifications 1. The partincludessevererrorsor a largenumber of minormistakes 2. The part is flawlessorincludesonly a fewminormistakes Source: On preparedness of memoryOrganizations for Ingesting Data. J.Lehtonen, H.Helin, K.Koivunen, K.Lehtonen, iPRES2013.

  38. Metadata results Source: On preparedness of memoryOrganizations for Ingesting Data. J.Lehtonen, H.Helin, K.Koivunen, K.Lehtonen, iPRES2013.

  39. Fileformatresults Source: On preparedness of memoryOrganizations for Ingesting Data. J.Lehtonen, H.Helin, K.Koivunen, K.Lehtonen, iPRES2013.

  40. Never the sameagain"πάντα χωρεῖ καὶ οὐδὲν μένει" καὶ "δὶς ἐς τὸν αὐτὸν ποταμὸν οὐκ ἂν ἐμβαίης"Pantachōreikaioudenmeneikaidises ton autonpotamonouk an embaies"Everything changes and nothing remains still ... and ... you cannot step twice into the same stream"[37Heraclitos Preservation is an openendproblem • Automatize as much as possible • Savetime: thoroughingestprocess • Remember the learningcurve • Acceptlifecycles: noteverythinghas to bestored for ever Source: wikipedia PD Imageresources

  41. Conclusions • Ifyouwantyourdigital data to survive, starttoday! • Equallyimportant: • Ingest • Ownership/stewardship • Preservationplanning • Preservationsolution • Clear definition of roles and organization • Collaboration! • Exit-strategy Source: wikipedia PD Imageresources

  42. Thankyou! • http://www.kdk.fi/http://www.tdata.fi • Pirjo-leena.forsstrom@csc.fi

More Related