1 / 61

Machine Learning for Functional Genomics I

Machine Learning for Functional Genomics I. Matt Hibbs http:// cbfg.jax.org. Central Dogma. Gene Expression. Proteins. DNA. Phenotypes. Functional Genomics. Identify the roles played by genes/proteins. Sealfon et al. , 2006. Gene Expression Microarrays.

jerrod
Download Presentation

Machine Learning for Functional Genomics I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning forFunctional Genomics I Matt Hibbs http://cbfg.jax.org

  2. Central Dogma GeneExpression Proteins DNA Phenotypes

  3. Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006.

  4. Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome Conditions Genes

  5. Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome – in thousands of conditions Rich functional information in these data, but how can we utilize the entire compendia?

  6. Biological Data Explosion Huge repositories of biological data… Publically available microarrays in GEO Mouse genes with known process association # of measurements # of genes Year Year …are not directly translating into knowledge

  7. Why is there a Data-Knowledge Gap? • Many datasets are analyzed only once • Initial publication looks for hypothesis • Need standards for naming, formats, collection • Data should be aggregated and integrated • Modestly significant clues seen repeatedly can become convincing • “a preponderance of circumstantial evidence” • Scale of this problem overwhelms traditional biology

  8. Scalable Artificial Intelligence Computer science is really a study in scalability Use machine learning and data mining techniques to quickly identify important patterns

  9. Amazon Recommendations

  10. Amazon Recommendations Purchase History Item Rankings • Compare your purchase history to all other customers • Find commonalities between profiles • Predict potential purchases Machine Learning (Bayesian networks) Observe Browsing Patterns and Account Activity Recommendations

  11. Gene Function Prediction Purchase History Item Rankings Genome Scale Data MGI Annotations ≈ Observe Browsing Patterns and Account Activity Machine Learning (Bayesian networks) Laboratory Experiments Machine Learning (Bayesian networks) Recommendations Predictions

  12. Challenges for AI from Biology Input data is noisy, heterogeneous, constantly evolving Current knowledge is incomplete and biased Can be difficult to determine accuracy

  13. Promise of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions

  14. Reality of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions

  15. Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.

  16. Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.

  17. Similarity Search Approach Relevant Datasets Search Algorithm (SPELL) Data Collection Query Genes Related Genes • Re-frame analysis as exploratory search

  18. Key Insights X U  Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability

  19. Key Insights X U  Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability

  20. Dataset relevance weighting 0.15 0.82 0.05 0.55 Query Genes: Q= {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Calculate correlation measure among query for each dataset -- This is each datasets’ weight Datasets

  21. Identify Novel Partners 0.15 0.82 0.05 0.55 Query Genes: Q= {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 geneA geneB geneC Datasets Calculate weighted distance score for all other genes to the query set

  22. Identify Novel Partners 0.15 0.82 0.05 0.55 Query Genes: Q= {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Best score Worst score geneB geneC geneA + Takes advantage of functional diversity + Addresses statistical concerns + Fast running times [O(GDQ2)] (ms per query) + Top results are candidates for investigation + Search process is iterative to refine results Datasets Calculate weighted distance score for all other genes to the query set

  23. Key Insights X U  Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability

  24. Signal Balancing Data - SVD • Singular Value Decomposition (SVD) • Projects data into another orthonormal basis • Correlations in U (rather than X) often contain better biological signals

  25. Signal Balancing SVD Signal Balancing

  26. Signal Balancing • Use correlations among left singular vectors • Downweights dominant patterns, amplifies subtle patterns • Top eigengenes dominate data • Sometimes correspond to systematic bias • Often correspond to common biological processes • eg. ribosome biogenesis, etc. • Accuracy of signal balancing improved over re-projection

  27. Key Insights X U  Vt = • Context-Sensitive Search Process • Signal Balancing • Correlation Comparability

  28. Between-dataset normalization • Commonly used Pearson correlation yields greatly different distributions of correlation • These differences complicate comparisons Histograms of Pearson correlations between all pairs of genes DeRisi et al., 97 Primig et al., 00

  29. Between-dataset normalization • Fisher Z-transform, Z-score equalizes distributions • Increases comparability between datasets Histograms of Z-scores between all pairs of genes

  30. SPELL Algorithm Overview Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007.

  31. Web Interface http://spell.princeton.edu

  32. Evaluation of Performance • Leave-k-in cross validation / bootstrapping • Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006) • Many predictions also verified through experimental validations in other studies • Hibbs et al., Bioinf, 2007 • Hess et al., PLoS Gen, 2009 • Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009

  33. Search Accuracy • Perform “leave-k-in” cross-validation Order Genome Master List … Rank Average Genes with common function For all pairs

  34. Search Accuracy • Precision-Recall Curve Master List 1 Precision TP TP TP + FP TP + FN 0 1 0 Recall

  35. Accuracy of Context-Sensitive Search

  36. Sample & Query Size Effects Even relatively small sample sizes produce similar results (1000 samples used for all other tests) Significant performance gain between 2 and 3 query genes, little change beyond (5 query genes used for all other tests)

  37. Effect of Signal Balancing Improvement is robust to missing value imputation method Signal balancing further improves context-specific search performance

  38. Effects of Signal Balancing signal balanced n% re-projection n% balanced

  39. Effects of Signal Balancing n% re-projection n% balanced

  40. Specific Performance

  41. Computational Solutions • Machine learning & data mining • Use existing data to make new predictions • Similarity search algorithms • Bayesian networks • Support vector machines • etc. • Validate predictions with follow-up lab work • Visualization & exploratory analysis • Seeing and interacting with data important • Show data so that questions can be answered • Scalability, incorporate statistics, etc.

  42. Function Prediction Evaluation Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009. • Cross-validation based on known biology • Most often used method in literature • Results are useful, but can be biased • Laboratory evaluation • More accurate, more difficult • Ultimate goal of functional genomics • Identify novel biology • Publish biological corpus

  43. Promise of Computational Functional Genomics Data & Existing Knowledge Laboratory Experiments Computational Approaches Predictions

  44. Petite Frequency Assay

  45. Petite Frequency Phenotypes for Predictions

  46. Overall Result Summary

  47. Double mutant petite freq.

  48. Mitochondrial Motility

  49. Respiratory Growth Rate

More Related