180 likes | 198 Views
This project aims to develop practices and standards for data attribution and citation in the domain of disease network modeling. It includes a review of data citation issues and technology, understanding the domain, documenting processes, and working with partners to create a demonstrator. The project is supported by SageCite and involves collaboration with Sage Bionetworks, a US-based non-profit organization focused on community-based data-intensive biological discovery.
E N D
Monica Duke m.duke@ukoln.ac.uk Project Manager, SageCite Project http://blogs.ukoln.ac.uk/sagecite/ #sagecite Developing Data Attribution and Citation Practices and Standards An International Symposium and Workshop August 22-23, 2011 UKOLN is supported by:
Citation in the domain of disease network modelling Funded: August 2010 – July 2011
SageCite project overview • Review of data citation (issues, technology) • Understanding the domain • Sage Bionetworks partners in project • Site visit • Documenting processes (workflow tools)
SageCite project overview • Demonstrator • Adding support for data citation • Using DataCite services • Working with publishers • Benefits analysis: KRDS Taxonomy
www.sagebase.org • US-based non-profit organisation • Creating a resource for community-based, data-intensive biological discovery • Community-based analysis is required to build accurate model
www.sagebase.org • US-based non-profit organisation • Creating a resource for community-based, data-intensive biological discovery • Community-based analysis is required to build accurate models
Sage data and processes • Idealised 7-stage process • A combination of phenotypic, genetic, and expression data are processed to determine a list of genes associated with diseases • Different people are responsible for different stages of the modelling process. One person oversees the whole process.
Stage 1: Data Curation • basic data validation to ensure integrity and completeness • datasets include microarray data and clinical data. • ensures that the format of the data is understood and the required metadata is present.
Agreeing standards to support sharing • Derry J et. al Developing predictive Molecular Maps of Human Disease through Community-based Modeling. • http://precedings.nature.com/documents/5883/version/1/files/npre20115883-1.pdf
Workflow capture using Taverna http://www.vimeo.com/27287109 Documenting data processes through workflow tools • supports better citation • makes the cited resource more re-usable • strengthening the reproducibility and validation of the research.
Data Citation Purposes • For attribution • Leading to credit and reward • For reproducibility • Supports validation, re-use • Eric Schadt at Sage Bionetworks Congress 2011 • http://fora.tv/2011/04/16/Eric_Schadt_Map_Building (start at 4.28)
Open challenges: attribution • Preserving link with original data • Some discipline-based repositories have their own identifiers • Bi-directional links • Attributing data creators • including individuals? • Defining creation of new intellectual object e.g. curated dataset? • Cultural challenge in recognising non-standard contributions; microattribution • New metrics • Identification of contributors
Open challenges: reproducibility • Identification and granularity • Discipline identifiers, global identifiers • How much value has been added since the data entered the workflow? • Identifying processes and software
Acknowledgements • UKOLN • Liz Lyon • Monica Duke • Nature Genetics • Myles Axton • PLoS Comp Bio • Phil Bourne • University of Manchester • Carole Goble • Peter Li • British Library • Max Wilkinson • Tom Pollard • Sage Bionetworks