Developing Data Attribution and Citation Practices and Standards

Monica Duke m.duke@ukoln.ac.uk Project Manager, SageCite Project http://blogs.ukoln.ac.uk/sagecite/ #sagecite Developing Data Attribution and Citation Practices and Standards An International Symposium and Workshop August 22-23, 2011 UKOLN is supported by:

Citation in the domain of disease network modelling Funded: August 2010 – July 2011

SageCite project overview • Review of data citation (issues, technology) • Understanding the domain • Sage Bionetworks partners in project • Site visit • Documenting processes (workflow tools)

SageCite project overview • Demonstrator • Adding support for data citation • Using DataCite services • Working with publishers • Benefits analysis: KRDS Taxonomy

www.sagebase.org • US-based non-profit organisation • Creating a resource for community-based, data-intensive biological discovery • Community-based analysis is required to build accurate model

www.sagebase.org • US-based non-profit organisation • Creating a resource for community-based, data-intensive biological discovery • Community-based analysis is required to build accurate models

Slide by Lara Mangravite Sage Bionetworks

Sage data and processes • Idealised 7-stage process • A combination of phenotypic, genetic, and expression data are processed to determine a list of genes associated with diseases • Different people are responsible for different stages of the modelling process. One person oversees the whole process.

Stage 1: Data Curation • basic data validation to ensure integrity and completeness • datasets include microarray data and clinical data. • ensures that the format of the data is understood and the required metadata is present.

Agreeing standards to support sharing • Derry J et. al Developing predictive Molecular Maps of Human Disease through Community-based Modeling. • http://precedings.nature.com/documents/5883/version/1/files/npre20115883-1.pdf

Workflow capture using Taverna http://www.vimeo.com/27287109 Documenting data processes through workflow tools • supports better citation • makes the cited resource more re-usable • strengthening the reproducibility and validation of the research.

Data Citation Purposes • For attribution • Leading to credit and reward • For reproducibility • Supports validation, re-use • Eric Schadt at Sage Bionetworks Congress 2011 • http://fora.tv/2011/04/16/Eric_Schadt_Map_Building (start at 4.28)

Open challenges: attribution • Preserving link with original data • Some discipline-based repositories have their own identifiers • Bi-directional links • Attributing data creators • including individuals? • Defining creation of new intellectual object e.g. curated dataset? • Cultural challenge in recognising non-standard contributions; microattribution • New metrics • Identification of contributors

Open challenges: reproducibility • Identification and granularity • Discipline identifiers, global identifiers • How much value has been added since the data entered the workflow? • Identifying processes and software

Acknowledgements • UKOLN • Liz Lyon • Monica Duke • Nature Genetics • Myles Axton • PLoS Comp Bio • Phil Bourne • University of Manchester • Carole Goble • Peter Li • British Library • Max Wilkinson • Tom Pollard • Sage Bionetworks

Developing Data Attribution and Citation Practices and Standards

Developing Data Attribution and Citation Practices and Standards

Presentation Transcript

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by:

UKOLN is supported by: