1 / 89

Sampling Bayesian Networks

Sampling Bayesian Networks. ICS 275b 2005. Approximation Algorithms. Structural Approximations Eliminate some dependencies Remove edges Mini-Bucket Approach Search Approach for optimization tasks: MPE, MAP Sampling Generate random samples and compute values of interest from samples,

deon
Download Presentation

Sampling Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling Bayesian Networks ICS 275b 2005

  2. Approximation Algorithms Structural Approximations • Eliminate some dependencies • Remove edges • Mini-Bucket Approach Search Approach for optimization tasks: MPE, MAP Sampling Generate random samples and compute values of interest from samples, not original network

  3. Algorithm Tree

  4. Sampling • Input: Bayesian network with set of nodes X • Sample = a tuple with assigned values s=(X1=x1,X2=x2,… ,Xk=xk) • Tuple may include all variables (except evidence) or a subset • Sampling schemas dictate how to generate samples (tuples) • Ideally, samples are distributed according to P(X|E)

  5. Sampling • Idea: generate a set of samples T • Estimate P(Xi|E) from samples • Need to know: • How to generate a new sample ? • How many samples T do we need ? • How to estimate P(Xi|E) ?

  6. Sampling Algorithms • Forward Sampling • Likelyhood Weighting • Gibbs Sampling (MCMC) • Blocking • Rao-Blackwellised • Importance Sampling • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

  7. Forward Sampling • Forward Sampling • Case with No evidence • Case with Evidence • N and Error Bounds

  8. Forward Sampling No Evidence(Henrion 1988) Input: Bayesian network X= {X1,…,XN}, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: • For t = 0 to T • For i = 0 to N • Xi sample xit from P(xi | pai)

  9. r 0 0.3 1 Sampling A Value What does it mean to sample xit from P(Xi | pai) ? • Assume D(Xi)={0,1} • Assume P(Xi | pai) = (0.3, 0.7) • Draw a random number r from [0,1] If r falls in [0,0.3], set Xi = 0 If r falls in [0.3,1], set Xi=1

  10. Sampling a Value • When we sample xit from P(Xi | pai), most of the time, will pick the most likely value of Xi occasionally, will pick the unlikely value of Xi • We want to find high-probability tuples But!!!…. • Choosing unlikely value allows to “cross” the low probability tuples to reach the high probability tuples !

  11. Forward sampling (example)

  12. Forward Sampling-Answering Queries Task: given n samples {S1,S2,…,Sn} estimate P(Xi = xi) : Basically, count the proportion of samples where Xi = xi

  13. Forward Sampling w/ Evidence Input: Bayesian network X= {X1,…,XN}, N- #nodes E – evidence, T - # samples Output: T samples consistent with E • For t=1 to T • For i=1 to N • Xi sample xit from P(xi | pai) • If Xi in E and Xi xi, reject sample: • i = 1 and go to step 2

  14. Forward Sampling: Illustration Let Y be a subset of evidence nodes s.t. Y=u

  15. Forward Sampling –How many samples? Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Chebychev’s Bound.

  16. Forward Sampling - How many samples? Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Hoeffding’s Bound (full proof is given in Koller).

  17. Forward Sampling:Performance Advantages: • P(xi | pa(xi)) is readily available • Samples are independent ! Drawbacks: • If evidence E is rare (P(e) is low), then we will reject most of the samples! • Since P(y) in estimate of N is unknown, must estimate P(y) from samples themselves! • If P(e) is small, T will become very big!

  18. Problem: Evidence • Forward Sampling • High Rejection Rate • Fix evidence values • Gibbs sampling (MCMC) • Likelyhood Weighting • Importance Sampling

  19. Forward Sampling Bibliography • {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988

  20. Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990) “Clamping” evidence+ forward sampling+ weighing samples by evidence likelihood Works well for likelyevidence!

  21. Likelihood Weighting

  22. Likelihood Weighting where

  23. Likelyhood Convergence(Chebychev’s Inequality) • Assume P(X=x|e) has mean  and variance 2 • Chebychev: =P(x|e) is unknown => obtain it from samples!

  24. Error Bound Derivation K is a Bernoulli random variable

  25. Likelyhood Convergence 2 • Assume P(X=x|e) has mean  and variance 2 • Zero-One Estimation Theory (Karp et al.,1989): =P(x|e) is unknown => obtain it from samples!

  26. Local Variance Bound (LVB)(Dagum&Luby, 1994) • Let  be LVB of a binary valued network:

  27. LVB Estimate(Pradhan,Dagum,1996) • Using the LVB, the Zero-One Estimator can be re-written:

  28. Importance Sampling Idea • In general, it is hard to sample from target distribution P(X|E) • Generate samples from sampling (proposal) distribution Q(X) • Weigh each sample against P(X|E)

  29. Importance Sampling Variants Importance sampling: forward, non-adaptive • Nodes sampled in topological order • Sampling distribution (for non-instantiated nodes) equal to the prior conditionals Importance sampling: forward, adaptive • Nodes sampled in topological order • Sampling distribution adapted according to average importance weights obtained in previous samples [Cheng,Druzdzel2000]

  30. AIS-BN • The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks. • Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks.Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.

  31. Gibbs Sampling • Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) • Samples are dependent, form Markov Chain • Samples directly from P(X|e) • Guaranteed to converge when all P > 0 • Methods to improve convergence: • Blocking • Rao-Blackwellised • Error Bounds • Lag-t autocovariance • Multiple Chains, Chebyshev’s Inequality

  32. MCMC Sampling Fundamentals Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

  33. MCMC Sampling From (X) A sample St is an instantiation: Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

  34. Gibbs Sampling (Pearl, 1988) • A sample t[1,2,…],is an instantiation of all variables in the network: • Sampling process • Fix values of observed variables e • Instantiate node values in sample x0 at random • Generate samples x1,x2,…xT from P(x|e) • Compute posteriors from samples

  35. Ordered Gibbs Sampler Generate sample xt+1 from xt : In short, for i=1 to N: Process All Variables In Some Order

  36. Gibbs Sampling (cont’d)(Pearl, 1988) Markov blanket:

  37. Ordered Gibbs Sampling Algorithm Input: X, E Output: T samples {xt } • Fix evidence E • Generate samples from P(X | E) • For t = 1 to T (compute samples) • For i = 1 to N (loop through variables) • Xi sample xit from P(Xi | markovt \ Xi)

  38. Answering Queries • Query: P(xi |e) = ? • Method 1: count #of samples where Xi=xi: Method 2: average probability (mixture estimator):

  39. Importance vs. Gibbs wt

  40. Gibbs Sampling Example - BN X = {X1,X2,…,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  41. Gibbs Sampling Example - BN X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7

  42. Gibbs Sampling Example - BN X1 P (X1 |X02,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  43. Gibbs Sampling Example - BN X2 P(X2 |X11,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  44. Gibbs Sampling: Illustration

  45. Gibbs Sampling: Burn-In • We want to sample from P(X | E) • But…starting point is random • Solution: throw away first K samples • Known As “Burn-In” • What is K ? Hard to tell. Use intuition. • Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)

  46. Gibbs Sampling: Convergence • Converge to stationary distribution * : * = * P where P is a transition kernel pij = P(Xi Xj) • Guaranteed to converge iff chain is : • irreducible • aperiodic • ergodic ( i,j pij > 0)

  47. Irreducible • A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step). • In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.

  48. Aperiodic • Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.

  49. Ergodicity • A recurrent state is a state to which the chain returns with probability 1: nP(n)ij =  • Recurrent, aperiodic states are ergodic. Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

  50. Gibbs Convergence • Gibbs convergence is generally guaranteed as long as all probabilities are positive! • Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then: • once we sample and assign X=0, then we are forced to assign Y=0; • once we sample and assign Y=0, then we are forced to assign X=0;  we will never be able to change their values again! • Another problem: it can take a very long time to converge!

More Related