1 / 42

Systems analysis of innate immune mechanisms in infection – a role for HPC

Systems analysis of innate immune mechanisms in infection – a role for HPC. Peter Ghazal. What is Pathway Biology?. Pathway biology is…. A systems biology approach for understanding a biological process - empirically by functional association of

laddie
Download Presentation

Systems analysis of innate immune mechanisms in infection – a role for HPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Systems analysis of innate immune mechanisms in infection – a role for HPC Peter Ghazal

  2. What is Pathway Biology? Pathway biology is…. A systems biology approach for understanding a biological process - empirically by functional association of multiple gene products & metabolites - computationally by defining networks of cause-effect relationships.  Pathway Models link molecular; cellular; whole organism levels. FORMAL MODELS --- ALLOW PREDICTING the outcome of Costly or Intractable Experiments

  3. Focus and outline of talk • High through-put approaches to mapping and understanding host-response to infection. • Targeting the host NOT the “bug” as anti-infective strategy • Making HPC more accessible: SPRINT a new framework for high dimensional biostatistic computation Story starts at the bed side

  4. Differentially expressed genes in neonates control vs Infected (FDR p>1x10-5, FC±4) Sterol/ lipogenic

  5. Dealing with HTP data: Impact of data variability • Model for introducing biological and technical variation:

  6. Machine Learning methods: Random Forest (RF) Support Vector Machine (SVM) Linear Discriminant Analysis (LDA) K-Nearest Neighbour (K-NN) Modelling patient variability and biomarkers for classification • How different data characteristics affect the misclassification errors? • Factors investigated: • Data variability (biological and technical variations) • Training set size • Number of replications • Correlation between RNA biomarkers Mizanur Khondoker

  7. Error rate vs. (number of biomarkers, total variation) An example of a simulation model to quantify number of biomarkers and level of patient variability

  8. Conclusions from simulations • There is increased predictive value using multiple markers – although there is no magic number that can be recommended as optimal in all situations. • Optimal number greatly depends on the data under study. • The important determining factors of optimal number of biomarkers are: • The degree of differential expression (fold-change, p-values etc.) • Amount of biological and technical variation in the data. • The size of the training set upon which the classifier is to be built. • The number of replication for each biomarkers. • The degree of correlation between biomarkers. • Now possible to predict optimal number through simulation.

  9. Rule of five: Criteria for pathogenesis based biomarkers • Readily accessible • Multiple markers • Appropriately powered statistical association • Physiological relevance • Causally linked to phenotype Key challenge is mapping biomarkers into: biological context and understanding Requires an experimental model system

  10. Bone Marrow Blood Tissue Resident Macrophage (immature) ? Monocyte ? (Primary Signal) Inflammation IFN-gamma Lymphokines Activated T-Lymphocyte Myeloid Stem Cell Primed Macrophage (Secondary Signal) Endotoxin, IFN-gamma Pluripotent Stem Cell “Activated” Cytolytic Macrophage Promonocyte

  11. Transcriptional profile of MΦ activated by Ifng

  12. How do we tackle this? A sub-system study of cause effect relationships with a defined start (input) and end (output). Experimentation genetic screens microarrays Y2H mechanism based studies Literature Data-mining PATHWAY BIOLOGY Modelling Network analysis

  13. Mapping new nodes Literature Data-mining PATHWAY BIOLOGY Experimentation

  14. Transcriptional profile of MΦinfected with CMV

  15. Hypothesis generation • Blue zone vs red zone

  16. Down regulation of sterol pathway

  17. BUT… recorded changes are small – Do they have any effect?Next step modelling PATHWAY BIOLOGY Experimental data Pure and applied modelling Network inference analysis

  18. Workflow Literature derived model Known parameters Order of magnitude estimation ODE model Unknown parameters Vary parameters by an order of magnitude Ensemble average Ensemble of ODE models Results

  19. Cholesterol Synthesis Modelling • ODE model, Michaelis-Menten interactions • 57 Parameters • 25 Known Parameters • 32 Unknown Parameters • Algorithm • Using the first three time points, calculate an equilibrium state • Release model from equilibrium and simulate using enzyme data • For each unknown, consider this model across 3 orders of magnitude, • holding the other unknowns parameters fixed. Where available, parameters obtained from the Brenda enzyme database http://www.brenda-enzymes.info/

  20. Cholesterol (output of sterol pathway) results from simulation and expts Predictions: Experiments: Cholesterol rate/flux Cholesterol levels

  21. Lipidomic – mass spec results

  22. Infection down regulate cholesterol biosynthesis pathway and free intra-cellular cholesterol. Can now predict the behaviour of the pathway. But? Just as a good as UK (Met Office) weather predictions……because……

  23. Increasing complexity and size of biological data Solution: High Performance Computing (HPC)? Scalability issues related to increased complexity HPC for High Throughput Post-Genomic Data

  24. Volume of data Many research groups can now routinely generate high volumes of data Memory (RAM) handling: Input data size is too big Algorithms cause linear, exponential or other growth in data volume CPU performance: Routine analyses take too long Problems with large biological data sets

  25. Gene clustering using R on a high-spec workstation: 16,000 genes, k=12 gene clusters runs for ~30min 16,000 genes, k=40 gene clusters runs for ~10hrs Partitioning-Around-Medoids, n genes, k=12 clusters requested Memory fail limit Limitation examples: Clustering

  26. Outcome: Adverse effect on research • Arbitrary size reduction of input data • Batch processing of data • Analyses in smaller steps • Avoidance of some algorithms • Failure to analyse

  27. HPC takes many forms: clusters, networks, supercomputer, grid, GPUs, “cloud”, ... Provides more computational power HPC is technically accessible for most: Department own, Eddie, HECToR,... However! Solution: High Performance Computing

  28. HPC Access Hurdles • Cost of access • Time to adapt • Complex, require specialist skills • Consultancy (e.g. EPCC) only feasible on ad-hoc basis, not routinely

  29. HPC Access Hurdles HPC is (currently) optimal for: Specific problems that can be tackled as a project Individuals who are familiar with parallelisation and system architectures HPC is not optimal for: Routine/casual analyses of high-throughput data Ad-hoc and ever-changing analyses algorithms Data analysts without time or knowledge to sidestep into parallelisation software/hardware.

  30. Challenge two fold!! Provide a generic solution Easy to use Need a step change (up!) to broaden HPC access to all biologists

  31. Post Genomic Data R Biological Results A solution for analyses using R SPRINT (DPM & EPCC)) Very Large Post Genomic Data Very Large Post Genomic Data SPRINT HPC (Eddie) R R Biological Results

  32. SPRINT SPRINT has 2 components: HPC harness manages access to HPC Library of parallel R functions e.g. cor (correlation) pam (clustering) maxt (permutation Allows non-specialists to make use of HPC resources, with analysis functions parallelised by us or the R community.

  33. Code comparison data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- mt.maxT(smallgd, classlabel, test="t", side="abs") quit(save="no") library("sprint") data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- pmaxT(smallgd, classlabel, test="t", side="abs") pterminate() quit(save="no")

  34. Permutation Benchmark

  35. Correlation Benchmark

  36. Clustering Benchmarks

  37. Cloud (confidentiality issues) GPU (limitations is data size) Future

  38. Viral Interaction Networks Host Interaction Networks Bed-bench-models-almost back to bed Virus Antiviral Systemic Therapeutic Host New therapeutic and diagnostic opportunities

  39. THANK YOU& Acknowlegments to our sponsors

  40. Acknowledgement Mathieu Blanc Steven Watterson Mizanur Khondoker Paul Dickinson Thorsten Forster Muriel Mewissen Terry Sloan Jon Hill Michal Piotrowski Arthur Trew EPCC

More Related