1 / 11

Active Data Biology

Active Data Biology. Samuel Payne @ OmicsPNNL Pacific Northwest National Laboratory. Integrative Omics. Biology Repositories. Purpose: Hold data files and disseminate to public Mass Spectrometry Data Proteomics, Metabolomics PRIDE-EBI: 200TB Sequencing Data

cguerra
Download Presentation

Active Data Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Data Biology Samuel Payne @OmicsPNNL Pacific Northwest National Laboratory

  2. Integrative Omics

  3. Biology Repositories • Purpose: Hold data files and disseminate to public • Mass Spectrometry Data • Proteomics, Metabolomics • PRIDE-EBI: 200TB • Sequencing Data • Genomics, Transcriptomics, etc • SRA – 10^15 bases (through 2013). Discontinued because of space concerns • Imaging Data

  4. Data Growth- EBI

  5. Experience Sharing Data • ~500,000 mass spectrometry files • ~20-50x associated files • All data from 2000-2016 • 350 TB • Personal Web-server • List data with publications • Send upon request • Biodiversity Library • shared through ProteomeXchange • 13 TB (zipped). Data from 112 bacteria and archaea • 6 months of data sheparding to get transferred – 4 individuals • 70% of file downloads from public repository • Represents ~5% of our data.

  6. Overcoming Big Data Raw Data Identification Hypotheses Browse & Share 1 2 3

  7. Compliance is a losing venture • Sharing for compliance • Incomplete data • Incomplete meta-data • Low emotional investment • Sharing for collaboration • Invested in cooperation • Contains necessary and sufficient information • Better potential for reuse in general dissemination

  8. Collaborative Infrastructure • Version Control Systems • Allow asynchronous work • ‘track changes’ and save all provenance • GitHub, Bitbucket, SVN, etc.

  9. Active Data Biology • GitHub tracks • Data • Code • Insight • Collaboration

  10. Total Transparency

  11. Acknowledgements • Joon-Yong Lee • Ryan Wilson, Gary Kiebel, Grant Fujimoto • Funding: • PNNL’s Laboratory Directed R&D funds • US Dept of Energy, Early Career Award

More Related