1 / 21

Katerina Kechris , PhD Associate Professor Biostatistics and Informatics

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration. Katerina Kechris , PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver. Omics.

kaleb
Download Presentation

Katerina Kechris , PhD Associate Professor Biostatistics and Informatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical ChallengesTopic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver

  2. Omics • Large-scale analyses for studying a population of molecules or molecular mechanisms • High-throughput data • Examples • Genomics (entire genome – DNA) • Proteomics (study of protein repertoire) • Epigenomics (study of DNA and histone modifications)

  3. Omics Epigenome Phenome Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gif http://themedicalbiochemistrypage.org/images/hemoglobin.jpghttp://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gif

  4. Large-scale Projects & Databases NCI 60 Database

  5. Integration of Omics Data • Each type of data gives a different snapshot of the biological or disease system • Why integrate data? • Reduce false positives/negatives • Identify interactions between different molecules • Explore functional mechanisms

  6. Challenges • When to integrate? • Dimensionality • Resolution • Heterogeneity • Interactions and Pathways

  7. Challenge 1: When to integrate? • Early • Merging data to increase sample size • Intermediate • Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis • Late • Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results

  8. Genomic Meta-analysis:Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt.

  9. Assessing Genomic Overlap:Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:4 1660-1697.

  10. Challenge 2: Dimensionality • Most technologies produce 10Ks to 100Ks measurements per sample • Exponential increase with 2+ data types • Dimension reduction • Process data type separately (filtering) • Combine with model fitting • Multivariate analysis

  11. Sparse Multivariate Methods • Variable Selection, Discriminant Analysis, Visualization • Penalties (or regularization) to reduce parameter space, only a few entries are non-zero (sparsity) • Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35

  12. Challenge 3: Genomic Resolution • Base level (conservation, motif scores) • Regular intervals (expression/binding from tiling arrays) • Irregular intervals • Gene/ncRNA level data (expression) • Individual positions (SNP, methylation sites)

  13. Challenge 4: Heterogeneity • Technology-specific sources of error • Different pre-processing, normalization • Different amounts of missing values • Data matching • Different identifiers • Not always one-to-one (microarrays) • Imputation

  14. Challenge 4: Heterogeneity • Continuous • expression and binding data from microarrays, motif scores, protein/metabolite abundance • Counts • expression data from sequencing • 0-1 • conservation (UCSC), DNA methylation • Binary/Categorical • Thresh-holding (e.g., motif scores), genotype

  15. Case Study: Development • Ci • important for differentiation of appendages during development • transcription factor – binds to DNA near target genes http://www.biology.ualberta.ca/locke.hp/research.htm http://howardhughes.trinity.duke.edu Kechris Lab, CU Denver

  16. Hierarchical Mixture Model • Data • Transcriptome:Ci pathway mutants (expr) – irregular interval • Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level • Goal: Predict gene targets of Ci • Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review)

  17. Challenge 5: Interactions and Pathways • Known Pathways • Incorporate information in databases (curated but sparse) • e.g., KEGG pathways have metabolite – protein interactions (directed graphs) • De novo Pathways • Discover novel interactions

  18. Known Pathways gene metabolite Joint modeling of metabolite and transcript data to identify active pathways Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13:4 748-761

  19. de novo Interactions PHENOTYPE • Single data INTEGRATION • Pair-wise • Correlations (e.g., eQTL) • Bayesian networks • Multiple • Kernel-based methods • Probabilistic graphical models • Network analysis methylation site gene SNP protein metabolite gene

  20. de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3): 519-538.

  21. Summary Methodology • Meta-analysis • Permutation-based Methods • Sparse Multivariate Methods • Graphical Models • Network Analysis

More Related