1 / 36

Bayesian Networks & Gene Expression

Bayesian Networks & Gene Expression. A ij - the mRNA level of gene j in experiment i. The Problem. Genes. Goal: Learn regulatory/metabolic networks Identify causal sources of the biological phenomena of interest. j. Experiments. i. Analysis Approaches. Clustering of expression data

lemuel
Download Presentation

Bayesian Networks & Gene Expression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Networks & Gene Expression .

  2. Aij - the mRNA level of gene j in experiment i The Problem Genes Goal: • Learn regulatory/metabolic networks • Identify causal sources of the biological phenomena of interest j Experiments i

  3. Analysis Approaches • Clustering of expression data • Groups together genes with similar expression patterns • Does not reveal structural relations between genes • Boolean networks • Deterministic models of the logical interactions between genes • Deterministic, static • Linear models • Deterministic fully-connected linear model • Under-constrained, assumes linearity of interactions

  4. Probabilistic Network Approach • Characterize stochastic (non-deterministic!) relationships between expression patterns of different genes • Beyond pair-wise interactions => structure! • Many interactions are explained by intermediate factors • Regulation involves combined effects of several gene-products • Flexible in terms of types of interactions (not necessarily linear or Boolean!)

  5. Al Marge Homer Lisa Maggie Bart Probabilistic Network Structure Noisy stochastic process of hair heredity (pedigree example) : • A node represents an individual’sgenotype • An edge represents direct influence/interaction • Note: Bart’s genotype is dependent on Al’s genotype, but independent on it given Marge’s genotype

  6. Y1 Y2 X A Model: Bayesian Network Ancestor Structure: DAG • Meaning: a child is conditionally independent on its non-descendants, given the value of its parents • Does not impose causality, but suits modeling of causal processes Parent Child Non-descendent Non-descendent Descendent

  7. B A B C A C A B C Local Structures & Independencies • Common parent • Cascade • V-structure • The language is compact, the concepts are rich!

  8. Burglary Earthquake E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 Radio Alarm e b 0.9 0.1 0.01 0.99 e b Call Bayesian Network – CPDs Local Probabilities: CPD - conditional probability distribution P(Xi|Pai) • Discrete variables: Multinomial Distribution (can represent any kind of statistical dependency)

  9. P(X | Y) Y X Bayesian Network – CPDs (cont.) • Continuous variables: e.g. linear Gaussian

  10. B E R A C Bayesian Network Semantics • Every node is dependent on many others, BUT : •  k parents  O(2kn) vs. O(2n) params • => good for memory conservation & learning robustness Qualitative part DAG specifies conditional independence statements Quantitative part local probability models Unique joint distribution over domain + = • The joint distribution decomposes nicely: P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versus P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

  11. So, Why Bayesian Networks? • Flexible representation of (in)dependency structure of multivariate distributions and interactions • Natural for modeling global processes with local interactions => good for biology • Clear probabilistic semantics • Natural for statistical confidence analysis of results and answering of queries • Stochastic in nature: models stochastic processes & deals (“sums out”) noise in measurements

  12. Inference • Encloses answers to valuable queries, e.g.: • Is node X independent on node Y given nodes Z,W ? • What is the probability of X=true if (Y=false and Z=true)? • What is the joint distribution of (X,Y) if R=false? • What is the likelihood of some full assignment? • What is the most likely assignment of values to all the nodes of the network? • The queries are relatively efficient

  13. B E E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 R A e b 0.9 0.1 0.01 0.99 e b C Learning Bayesian Network The goal: • Given set of independent samples (assignments random variables) to random variables, find the best (the most likely?) Bayesian Network • (both DAG and CPDs) • { (B,E,A,C,R)=(T,F,F,T,F) • (B,E,A,C,R)=(T,F,T,T,F) • …….. • (B,E,A,C,R)=(F,T,T,T,F) }

  14. Learning Bayesian Network • Learning of best CPDs given DAG is easy (collect statistics of values of each node given specific assignment to its parents). But… • The structure (G) learning problem is NP-hard => heuristic search for best model must be applied, generally bring out a locally optimal network. • It turns out, that the richer structures give higher likelihood P(D|G) to the data (adding an edge is always preferable), because…

  15. A C Learning Bayesian Network A B B C • If we add B to Pa(C) , we have more parametes to fit => more freedom => can always optimize SPD(C) , such that: • But we prefer simpler (more explanatory) networks (Occam’s razor!) • Therefore, practical scores of Bayesian Networks compensate the likelihood improvement by “fine” on complex networks.

  16. Modeling Biological Regulation Variables of interest: • Expression levels of genes • Concentration levels of proteins • Exogenous variables: Nutrient levels, Metabolite Levels, Temperature • Phenotype information • … Bayesian Network Structure: • Capture dependencies among these variables

  17. Measured expression level of each gene Random variables Probabilistic dependencies Gene interaction A B A X B Possible Biological Interpretation Interactions are represented by a graph: • Each gene is represented by a node in the graph • Edges between the nodes represent direct dependency

  18. A B C More Local Structures • Dependencies can be mediated through other nodes • Common/combinatorial effects: B A B C A C Common cause Intermediate gene

  19. B E R A C The Approach of Friedman et al. Bayesian Network Learning Algorithm Expression data Use learned network to make predictions about structure of the interactions between genes – No prior biological knowledge is used!

  20. The Discretization Problem • The expression measurements are real numbers. • => We need to discretize them in order to learn general CPDs => lose information • => If we don’t, we must assume some specific type of CPD (like “linear Gaussian”) => lose generality

  21. Problem of Sparse Data • There are much more genes than experiments (=assignments) => many different networks suit the data well => we can’t rely only on the usual learning algorithm. • Shrink the network search space. E.g., we can use the notion, that in biological systems each gene is regulated directly by only a few regulators. • Don’t take for granted the resulting network, but instead fetch from it pieces of reliable information.

  22. Learning With Many Variables Sparse Candidate algorithm - efficient heuristic search, relies on sparseness of regulation nets. • For each gene, choose promising “candidate parents set” for direct influence for each gene • Find (locally) optimal BN constrained on those parent candidates for each gene • Iteratively improve candidate set candidates parents in BN

  23. Experiment Data from Spellman et al. (Mol.Bio. of the Cell 1998). • Contains 76 samples of all the yeast genome: • Different methods for synchronizing cell-cycle in yeast. • Time series at few minutes (5-20min) intervals. • Spellman et al. identified 800 cell-cycle regulated genes.

  24. Network Learned

  25. Challenge: Statistical Significance Sparse Data • Small number of samples • “Flat posterior” -- many networks fit the data Solution • estimate confidence in network features • E.g., two types of features • Markov neighbors: Xdirectly interacts with Y (have mutual edge or a mutual child) • Order relations: X is an ancestor of Y

  26. B E R A C Confidence Estimates D1 Bootstrap approach[FGW, UAI99] Learn resample D2 E B D Learn resample R A C ... resample E B Dm Learn R A Estimate “Confidence level”: C

  27. Normalization, Discretization Expression data Preprocess Bayesian Network Learning Algorithm, Mutation modeling + Bootstrap Learn model Feature extraction Edge Separator Markov Ancestor In summary… Result: a list of features with high confidence. They can be biologically interpreted.

  28. SST2 STE6 Exporter of mating factor Regulator in mating pathway Hidden Parent (two genes co-regulated by a hidden factor) Transcription factor GCN4 ARG5 ARG3 (0.84) ARG3 ARG5 Arginine Biosynthesis Arginine Biosynthesis Resulting Features: Markov Relations Question: Do X andY directly interact? Parent-Child (one gene regulating the other) SST2 STE6 (0.91) confidence

  29. Mating transcriptional regulator of nuclear fusion KAR4 AGA1 FUS1 Cell fusion Cell fusion Resulting Features: Separators • Question: Given that X andY are indirectly dependent, who mediates this dependence? • Separator relation: • X affects Z who in turn affects Z • Z regulates both X and Y

  30. MAPK of cell wall integrity pathway SLT2 CRH1 YPS3 YPS1 SLR3 Cell wall protein Cell wall protein Cell wall protein Protein of unknown function Separators: Intra-cluster Context • All pairs have high correlation • Clustered together

  31. MAPK of cell wall integrity pathway SLT2 CRH1 YPS3 YPS1 SLR3 Cell wall protein Cell wall protein Cell wall protein Protein of unknown function + + Separators: Intra-cluster Context • SLT2: Pathwayregulator, explains the dependence • Many signaling and regulatory proteins identified as direct and indirect separators

  32. Next Step: Sub-Networks • Automatic reconstruction • Goal: Dense sub-network with highly confident pair-wise features • Score: Statistical significance • Search: High scoring sub-networks • Advantages • Global picture • Structured context for interactions • Incorporate mid-confidence features

  33. Normalization, Discretization Expression data Preprocess Bayesian Network Learning Algorithm, Mutation modeling + Bootstrap Learn model Feature extraction Edge Separator Markov Ancestor Feature assembly Reconstruct Sub-Networks Global network Local features  Sub-network

  34. Results • 6 well structured sub-networks representing coherent molecular responses • Mating • Iron metabolism • Low osmolarity cell wall integrity pathway • Stationary phase and stress response • Amino acid metabolism, mitochondrial function and sulfate assimilation • Citrate metabolism • Uncovered regulatory, signaling and metabolic interactions

  35. Two branches: • Cell fusion • Outgoing Mating Signal Transcriptional regulator of nuclear fusion Signaling pathway regulator KAR4 SST2 KSS1 NDJ1 FUS1 PRM1 AGA1 TEC1 YLR343W TOM6 FIG1 FUS3 AGA2 YLR334C MFA1 STE6 YEL059W Genes that participate in Cell fusion “Mating response” Substructure We missed: STE12 (main TF), Fus3 (Main MAPK) is marginal

  36. More insights to incorporate in BN approach • Sequence (mainly promoters) • Protein-protein interactions measurements • Cluster analysis of genes and/or experiments Incorporating prior knowledge • Incorporate large mass of biological knowledge, and insight from sequence/structure databases

More Related