1 / 19

The EM algorithm Lecture #11

The EM algorithm Lecture #11. Acknowledgement: Some slides of this lecture are due to Nir Friedman. Expectation Maximization (EM) for Bayesian networks. Intuition (as before):

altessa
Download Presentation

The EM algorithm Lecture #11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman. .

  2. Expectation Maximization (EM) for Bayesian networks Intuition (as before): • When we have access to all counts, then we can find the ML estimate of all parameters in all local tables directly by counting. • However, missing values do not allow us to perform such counts. • So instead, we compute the expected counts using the current parameter assignment, and then use them to compute the maximum likelihood estimate. A B C D P(A= a | ) P(B=b|A= a, ) P(C=c|A= a, ) P(D=d|b,c, )

  3. X Z Y N (X,Z ) N (X,Y ) X X Z Y # # 1.30.41.71.6 0.10.22.91.8 HHTT HHTT HTHT HTHT Expectation Maximization (EM) Expected Counts Data P(Y=H|X=H,Z=T, ) = 0.3 Y Z X HTHHT ??HTT T??TH 0.2 Current parameters 0.1 P(Y=H|X=T, ) = 0.4

  4. Reiterate Updated network (G, ) Expected Counts N(X1) N(X2) N(X3) N(Y, X1, X1, X3) N(Z1, Y) N(Z2, Y) N(Z3, Y) Computation Reparameterize X1 X1 X2 X2 X3 X3 Y (M-Step) (E-Step) Y Z1 Z1 Z2 Z2 Z3 Z3 EM (cont.) Initial network (G, ’)  Note: This EM iteration corresponds to the non-homogenous HMM iteration. When parameters are shared across local probability tables or are functions of each other, changes are needed. Training Data

  5. EM in Practice Initial parameters: • Random parameters setting • “Best” guess from other source Stopping criteria: • Small change in likelihood of data • Small change in parameter values Avoiding bad local maxima: • Multiple restarts • Early “pruning” of unpromising ones

  6. We define the relative entropy H(P||Q) for two probability distributions P and Q of a variable X (with x being a value of X) as follows: H(P||Q)=  P(xi) log2(P(xi)/Q(xi)) xi Relative Entropy – a measure of difference among distributions This is a measure of difference between P(x) and Q(x). It is not a symmetric function. The distribution P(x) is assumed the “true” distribution used for taking the expectation of the log of the difference . The following properties hold: H(P||Q)  0 Equality holds if and only if P(x) = Q(x) for all x.

  7. Average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy H(P ||Q) where Q(a,b) = Q(a) Q(b). Relative entropy also arises when choosing amongst competing models.

  8. The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) (Because P(x,y| ) = P(x| ) P(y|x, )). The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayes network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize.

  9. The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector  such that P(x| ) > P(x| ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ). The goal of EM algorithm The log-likelihood of ONE observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,)

  10. Q( |’)  The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] The Expectation Operator Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y). The expectation of a function L(Y) is given by E[L(Y)] = y p(y) L(y). An example used by the EM algorithm: E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)

  11. Log P(x |) =  P(y|x, ’) log P(x ,y|) -  P(y|x, ’) log P(y |x, ) y y = E’[log p(x,y|)] = Q( |’) 0 (relative entropy) We now observe that • = log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) +  P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ). Improving the likelihood Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields

  12. The EM algorithm Input: A likelihood function p(x,y| ) parameterized by . Initialization: Fix an arbitrary starting value ’ Repeat E-step: Compute Q( | ’) = E’[log P(x,y| )] M-step: ’ argmax Q(| ’) Until  = log P(x| ) – log P(x|’) <  Comment: At the M-step one can actually choose any ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

  13. Each sample Completed separately. E-step: Compute Q( | ’) = E’[i log P(xi,yi| )] =iE’[log P(xi,yi| )] M-step: ’ argmax Q(| ’) The EM algorithm (with multiple independent samples) Recall the log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) For independent samples (xi, yi), i=1,…,m, we can write: i log P(xi| ) = i log P(xi,yi| ) – ilog P(yi|xi,)

  14. Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem log P(x| ) E ’[log P(x,y| )] 

  15. The likelihood of the completed data of n points: P(x,y| ) = P(nAB,nO ,na/a,na/o, nb/b, nb/o| ) = Gene Counting Revisited (as EM) The observations: The variables X=(NA, NB, NAB, NO) with a specific assignment x = (nA,nB,nAB,nO). The hidden quantity: The variables Y=(Na/a, Na/o, Nb/b, Nb/O) with a specific assignment y = (na/a,na/o, nb/b, nb/O). The parameters: ={a,b,o }.

  16. The E-step of Gene Counting The likelihood of the hidden data given the observed data of n points: P(y| x, ’) = P(na/a,na/o, nb/b, nb/o| nA,nB , nAB,nO,’) = P(na/a,na/o | nA,’a,’o ) P(nb/b,nb/o | nB, ’b,’o ) This is exactly the E-step we used earlier !

  17. The M-step of Gene Counting The log-likelihood of the completed data of n points: Taking expectation wrt Y =(Na/a, Na/o, Nb/b, Nb/o) and using linearity of E yields the function Q(| ’) which we need to maximize:

  18. Which matches the M-step we used earlier ! The M-step of Gene Counting (Cont.) We need to maximize the function: Under the constraint a+ b+ o =1. The solution (obtained using Lagrange multipliers) is given by

  19. Outline for a different derivation of Gene Counting as an EM algorithm Define a variable X with values xA,xB,xAB,xO. Define a variable Y with values ya/a, ya/o, yb/b, yb/o, ya/b, yo/o. Examine the Bayesian network: Y X The local probability table for Y is P(ya/a| ) = a a , P(ya/o| ) = 2a o , etc. The local probability table for X given Y is P(xA | ya/q ,) = 1,P(xA | ya/o ,) = 1, P(xA | yb/o ,) = 0, etc, only 0’s and 1’s. Homework: write down for yourself the likelihood function for n independent points xi,yi, and check that the EM equations match the gene counting equations.

More Related