1 / 37

Curation: making data suitable for re-use

Curation: making data suitable for re-use. Chris Rusbridge Presentation at FIBS Seminar. Contents. Science and digital curation What to do with your data: frontiers of practice Repository frontiers. Digital Curation Centre Mission.

aulani
Download Presentation

Curation: making data suitable for re-use

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Curation: making data suitable for re-use Chris Rusbridge Presentation at FIBS Seminar

  2. Contents • Science and digital curation • What to do with your data: frontiers of practice • Repository frontiers

  3. Digital Curation Centre Mission “The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

  4. TWOMASS (Infrared) SDSS (Visual) Slide from Rajendra Bose

  5. Slide from Rajendra Bose

  6. New discovery… • National Virtual Observatory • Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”

  7. Curation • Data increasingly important as evidence • Key part of the scholarly record • Experimental verifiability (the basis of science) • Allows additional interpretations • Unrepeatable observations & experiments (particularly environmental in broadest sense) • Legal, compliance & transactions • Cultural resources

  8. What kinds of data? • Observations • eg UARS (Upper Atmosphere) Level 0: telemetry • UARS Level 1: measured physical parameters (post calibration?) • Derived data • UARS Level 2: calculated geophysical? profiles • UARS level 3: gridded, interpolated? • Combined data • Crafted data • Eg annotated gene/protein databases • Descriptive (meta)data

  9. What to do with it? • Keep as part of experiment • Deposit in institutional or discipline repository • Possible time-limited embargos • Cite it • “Publish” in support for articles

  10. Internet Archaeology: publication with data (sadly, a preservation nightmare!)

  11. What are the reusability issues? • Data not neutral to hypothesis • Hard to know the risks & pitfalls of a particular dataset • Data not self-describing: hard to find appropriate data • Hard to “understand” data once found • Hard to use data once understood

  12. What to do about it? • Build curation/reusability into your workflow • Curation begins before creation • What’s easy at first becomes (impossibly) hard later • Describe your data (metadata) • Keep experimental parameters (technical, who, what, when, where etc) • Keep data descriptions (schemas, “representation information”, etc) • Keep data! • Use standard/agreed formats for data • Make ownership & restrictions clear • Explain how to cite your data

  13. Data resource stages • Curated data is created… • Observations? Fixed! • Or Acquired… • Data brought/bought from outside • Ingest • Development • Derived, refined, combined, processed data • Potentially many stages

  14. Context • Data meaningless without context • Linkage • Metadata of many kinds • Workflow! • Provenance • Authenticity • Computational lineage

  15. NASA research group3 University research group1 University research group2 local decision-making body Slide from Rajendra Bose

  16. Access and re-use • Ethics and rights control access • Weak in expressing this long-term • Collaboration tools • Annotation, discussion, review • Re-use leading to change and development • “Publication” • Not just in “print” • Underlying data should be “published”, too • Citation…

  17. Citation needs… • An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al) • Not important for original observations • Don’t mess with those data • Less important for incremental datasets • Later stuff should not invalidate earlier • Very important for revisable datasets • Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change • Eg Mapping… OS maps represent a huge database that changes on a daily basis

  18. Who are the curation players?

  19. Curation: Individual • “Small science 2-3 times more data than Big science”, but much more at risk • PhD student? RA? PI? Administrator? IT support? • Data potentially on local hard drives, or at best shared network drives • May be inadequately protected • Liable for policy-led deletion on resignation • Individual “knows” too much • Documentation/metadata unlikely to be adequate • Future: gone!

  20. Department: eCrystals • Partnership with Institutional Repository • Specialist department archive (& national service) • Workflow recording of lab parameters (R4L) • Public & private elements • Trying to build eCrystals federation (eBank 3) • Future: likely to continue

  21. Institution: Cambridge Chemistry • 175,000 small molecule structures in CML • Alongside Archaeology, Manuscripts, Learning Materials, etc • No library curation skills; dependent on research group enthusiast • Collection isolated from other Chemistry • (Only 5 UK institutional repositories claim to hold data) • Future: assured…

  22. Community: LOCKSS? • Self-selected group of collectors: closest to genuine open activity (despite Alliance)? • Traditionally libraries collecting eJournals • Model respects IPR • No domain expertise; rely on origins • Data limitations… • Future: potentially very persistent (low cost, high reliability, attack resistance, distributed)

  23. Discipline: Atmospheric Science • Strong believer in need for domain scientists as curators • Significant participant in “community proxy” agenda-setting activities • Internationally fragmented resources • Future: mostly dependent on grant funding (but strong commitment)

  24. Discipline: Pharmacology • International Scientific Union • Attempting to build credit for data contributions • Future: extremely limited funding

  25. Discipline: Bio/Health • UK PubMedCentral! • (you heard about this earlier)

  26. Issues: Nature article 23 June 05 • Databases in Peril • 51 out of 89 biological databases contacted reported they were struggling financially • 7 have closed • Several being updated in owner’s spare time • (Notes that not all deserve long term support) • [Nucleic Acids Research reports 858 databases in 2006!] • Major issue: money

  27. Publisher: Crystallography • Publisher and Scientific Union • Created key domain crystallographic standard (CIF) • Strong motivator for deposit of structure data • Consistent quality checks • DOIs used for structure data • Future: publishing business model Slide from IUCr

  28. National bodies: British Library • Serious and robust approach • Legal deposit powers & responsibilities as driver • Oriented primarily towards “cultural heritage” (broadly interpreted) • Little data, no science domain experience • Future: strong future commitment

  29. National bodies: TNA/NDAD • Specialist archive for government datasets • Understand government regulations, dynamics & requirements • Subject generalists; disconnected from associated science • Technology specialists (understand databases) • Future: likely to pass eventually to The National Archives

  30. 3rd parties: Portico • Specific area: eJournals • Depends on publisher agreements • No data or domain science expertise • Future: commitment from Mellon + publishers + subscriptions, good funding mix

  31. 3rd Parties: Iron Mountain? • Records management IS a curation problem • Organisations like this very likely to branch out • No domain science expertise • Future: business case, viability, stock market…

  32. Institutions & the network • Institutions have fundamental sustainability • Disciplines have domain knowledge advantage but sustainability is an issue • Can we get the best of both?

  33. Intersections…

  34. Who are the curation players again?

  35. BEWARE WEB 2.0!!!

  36. Thank you

More Related