1 / 89

Part IV: Inference algorithms

This part of the series focuses on solving computational problems in probabilistic models, including estimating parameters in models with latent variables and computing posterior distributions involving a large number of variables.

bounds
Download Presentation

Part IV: Inference algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part IV: Inference algorithms

  2. Estimation and inference • Actually working with probabilistic models requires solving some difficult computational problems… • Two key problems: • estimating parameters in models with latent variables • computing posterior distributions involving large numbers of variables

  3. Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables

  4. Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables

  5. SUPERVISED dog dog cat dog dog cat dog cat dog cat dog cat dog cat cat dog

  6. Supervised learning Category A Category B What characterizes the categories? How should we categorize a new observation?

  7. Parametric density estimation • Assume that p(x|c) has a simple form, characterized by parameters  • Given stimuli X = x1, x2, …, xn from category c, find  by maximum-likelihood estimation or some form of Bayesian estimation

  8. Spatial representations • Assume a simple parametric form for p(x|c): a Gaussian • For each category, estimate parameters • mean • variance c P(c) x p(x|c) } 

  9. standard deviation mean The Gaussian distribution Probability density p(x) (x-)/ variance = 2

  10. Estimating a Gaussian X = {x1, x2, …, xn}independently sampled from a Gaussian

  11. Estimating a Gaussian X = {x1, x2, …, xn}independently sampled from a Gaussian maximum likelihood parameter estimates:

  12. mean variance/covariance matrix quadratic form Multivariate Gaussians

  13. Estimating a Gaussian X = {x1, x2, …, xn} independently sampled from a Gaussian maximum likelihood parameter estimates:

  14. Bayesian inference Probability x

  15. UNSUPERVISED

  16. Unsupervised learning What latent structure is present? What are the properties of a new observation?

  17. An example: Clustering Assume each observed xi is from a cluster ci, where ci is unknown What characterizes the clusters? What cluster does a new x come from?

  18. Density estimation • We need to estimate some probability distributions • what is P(c)? • what is p(x|c)? • But… c is unknown, so we only know the value of x c P(c) x p(x|c)

  19. Supervised and unsupervised Supervised learning: categorization • Given x = {x1, …, xn} and c = {c1, …, cn} • Estimate parameters of p(x|c) and P(c) Unsupervised learning: clustering • Given x = {x1, …, xn} • Estimate parameters of p(x|c) and P(c)

  20. mixture weights Mixture distributions mixture distribution mixture components Probability x

  21. More generally… Unsupervised learning is density estimation using distributions with latent variables z P(z) Latent (unobserved) x Marginalize out (i.e. sum over) latent structure P(x|z) Observed

  22. A chicken and egg problem • If we knew which cluster the observations were from we could find the distributions • this is just density estimation • If we knew the distributions, we could infer which cluster each observation came from • this is just categorization

  23. Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments ci: 2. Given assignments ci, solve for maximum likelihood parameter estimates: 3. Go to step 1

  24. Alternating optimization algorithm x c: assignments to cluster ,, P(c): parameters  For simplicity, assume , P(c) fixed: “k-means” algorithm

  25. Alternating optimization algorithm Step 0: initial parameter values

  26. Alternating optimization algorithm Step 1: update assignments

  27. Alternating optimization algorithm Step 2: update parameters

  28. Alternating optimization algorithm Step 1: update assignments

  29. Alternating optimization algorithm Step 2: update parameters

  30. Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments ci: 2. Given assignments ci, solve for maximum likelihood parameter estimates: 3. Go to step 1 why “hard” assignments?

  31. Estimating a Gaussian(with hard assignments) X = {x1, x2, …, xn} independently sampled from a Gaussian maximum likelihood parameter estimates:

  32. Estimating a Gaussian(with soft assignments) the “weight” of each point is the probability of being in the cluster maximum likelihood parameter estimates:

  33. The Expectation-Maximizationalgorithm(clustering version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over assignments ci: 2. Solve for maximum likelihood parameter estimates, weighting each observation by the probability it came from that cluster 3. Go to step 1

  34. The Expectation-Maximizationalgorithm(more general version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over latent variablesz: 2. Find parameter estimates 3. Go to step 1

  35. A note on expectations • For a function f(x) and distribution P(x), the expectation of f with respect to P is • The expectation is the average of f, when x is drawn from the probability distribution P

  36. Good features of EM • Convergence • guaranteed to converge to at least a local maximum of the likelihood (or other extremum) • likelihood is non-decreasing across iterations • Efficiency • big steps initially (other algorithms better later) • Generality • can be defined for many probabilistic models • can be combined with a prior for MAP estimation

  37. Limitations of EM • Local minima • e.g., one component poorly fits two clusters, while two components split up a single cluster • Degeneracies • e.g., two components may merge, a component may lock onto one data point, with variance going to zero • May be intractable for complex models • dealing with this is an active research topic

  38. EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” • anywhere there’s a “chicken and egg” problem • a prime example: language learning

  39. 1.0 NP VP 0.7 1.0 T N V NP 0.7 0.8 0.5 0.6 man T N hit the 0.8 0.5 the ball Probabilistic context free grammars S S  NP VP 1.0 NP  T N 0.7 NP  N 0.3 VP  V NP 1.0 T  the 0.8 T  a 0.2 N  man 0.5 N  ball 0.5 V  hit 0.6 V  took 0.4 P(tree) = 1.00.71.00.80.50.60.70.80.5

  40. EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” • anywhere there’s a “chicken and egg” problem • a prime example: language learning • Fried and Holyoak (1984) explicitly tested a model of human categorization that was almost exactly a version of the EM algorithm for a mixture of Gaussians

  41. Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables

  42. The Monte Carlo principle • The expectation of f with respect to P can be approximated by where the xi are sampled from P(x) • Example: the average # of spots on a die roll

  43. The Monte Carlo principle The law of large numbers Average number of spots Number of rolls

  44. Markov chain Monte Carlo • Sometimes it isn’t possible to sample directly from a distribution • Sometimes, you can only compute something proportional to the distribution • Markov chain Monte Carlo: construct a Markov chain that will converge to the target distribution, and draw samples from that chain • just uses something proportional to the target

  45. Markov chains Variables x(t+1) independent of all previous variables given immediate predecessor x(t) x x x x x x x x Transition matrix T = P(x(t+1)|x(t))

  46. An example: card shuffling • Each state x(t) is a permutation of a deck of cards (there are 52! permutations) • Transition matrix T indicates how likely one permutation will become another • The transition probabilities are determined by the shuffling procedure • riffle shuffle • overhand • one card

  47. Convergence of Markov chains • Why do we shuffle cards? • Convergence to a uniform distribution takes only 7 riffle shuffles… • Other Markov chains will also converge to a stationary distribution, if certain simple conditions are satisfied (called “ergodicity”) • e.g. every state can be reached in some number of steps from every other state

  48. Markov chain Monte Carlo • States of chain are variables of interest • Transition matrix chosen to give target distribution as stationary distribution x x x x x x x x Transition matrix T = P(x(t+1)|x(t))

  49. Metropolis-Hastings algorithm • Transitions have two parts: • proposal distribution: Q(x(t+1)|x(t)) • acceptance: take proposals with probability A(x(t),x(t+1)) = min( 1, ) P(x(t+1)) Q(x(t)|x(t+1)) P(x(t)) Q(x(t+1)|x(t))

  50. Metropolis-Hastings algorithm p(x)

More Related