1 / 51

Information Bottleneck versus Maximum Likelihood

Information Bottleneck versus Maximum Likelihood. Felix Polyakov. Overview of the talk. Brief review of the Information Bottleneck Maximum Likelihood Information Bottleneck and Maximum Likelihood Example from Image Segmentation. A Simple Example. Simple Example.

dexter
Download Presentation

Information Bottleneck versus Maximum Likelihood

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Bottleneck versus Maximum Likelihood Felix Polyakov

  2. Overview of the talk • Brief review of the Information Bottleneck • Maximum Likelihood • Information Bottleneck and Maximum Likelihood • Example from Image Segmentation

  3. ASimple Example...

  4. Simple Example

  5. A new compact representation The document clusters preserve the relevant information between the documents and words

  6. Feature Selection? • NO ASSUMPTIONS about the source of the data • Extracting relevant structure from data • functions of the data (statistics) that preserve information • Information about what? • Need a principle that is both general and precise.

  7. Documents Words

  8. The information bottleneck or relevance through distortion N. Tishby, F. Pereira, and W. Bialek • We would like the relevant partitioning T to compress X as much as possible, and to capture as much information about Y as possible Y X

  9. Goal: find q(T | X) • note Markovian independence relationT  X  Y

  10. Variational problem Iterative algorithm

  11. Overview of the talk • Short review of the Information Bottleneck • Maximum Likelihood • Information Bottleneck and Maximum Likelihood • Example from Image Segmentation

  12. Likelihood of the Data Probability of a head A simple example... A coin is known to be biased The coin is tossed three times – two heads and one tail Use ML to estimate the probability of throwing a head • Model: • p(head) = P • p(tail) = 1 - P • Try P = 0.2 L(O) = 0.2 * 0.2 * 0.8 = 0.032 • Try P = 0.4 L(O) = 0.4 * 0.4 * 0.6 = 0.096 • Try P = 0.6 L(O) = 0.6 * 0.6 * 0.4 = 0.144 • Try P = 0.8 L(O) = 0.8 * 0.8 * 0.2 = 0.128

  13. A bit more complicated example… :Mixture Model • Three baskets with white (O= 1), grey (O = 2), and black (O = 3) balls B1 B2 B3 • 15 balls were drawn as follows: • Choose a basket according to p(i) =  bi • Draw the ball j from basket i with probability • Use ML to estimate  given the observations: sequence of balls’ colors

  14. Likelihood of observations • Log Likelihood of observations • Maximal Likelihood of observations

  15. Likelihood of the observed data • x – hidden random variables [e.g. basket] • y – observed random variables [e.g. color] • - model parameters [e.g. they define p(y|x)] 0 – current estimate of model parameters

  16. 2. Maximization • EM algorithm converges to local maxima Expectation-maximization algorithm (I) • Expectation • Compute • Get

  17. Log-likelihood is non-decreasing, examples

  18. Jensen’s inequality for concave function EM – another approach • Goal:

  19. 2. Maximization Expectation-maximization algorithm (II) • Expectation (I) and (II) are equivalent

  20. Scheme of the approach Expectation Maximization

  21. Overview of the talk • Short review of the Information Bottleneck • Maximum Likelihood • Information Bottleneck and Maximum Likelihood for a toy problem • Example from Image Segmentation

  22. Words - Y Documents - X t ~ (t) x ~ (x) y|t ~ (y|t) Topics - t

  23. Model parameters Example • xi = 9 • t(9) = 2 • sample from (y|2) • get yi = “Drug” • set n(9, “Drug”) = n(9, “Drug”) + 1 Sampling algorithm • For i = 1:N • choose xi by sampling from (x) • choose yi by sampling from (y|t(xi)) • increase n(xi, yi) by one

  24. (y|t=1) (y|t=2) (y|t=3) X t(X)

  25. X Y t(x) (y|t(x)) = topics X = documents Y = words Toy problem: which parameters maximize the likelihood?

  26. Normalization factor EM approach • E-step • M-step

  27. IB approach Normalization factor

  28. , , ML r is a scaling constant IB

  29. IBMLmapping , , • X is uniformly distributed • r = |X| • The EM algorithm is equivalent to the IB iterative algorithm + + + + EM + IB ML Iterative IB +

  30. IBMLmapping • X is uniformly distributed •  = n(x) • All the fixed points of the likelihood L are mapped to all the fixed points of the IB-functional L= I(T;X) -  I(T;Y) • At the fixed points –log L  L+ const + + + ML IB +

  31. X is uniformly distributed •  = n(x) • -(1/r) F - H(Y) = L • -F  L+ const • Every algorithm increases F, iff it decreases L

  32. ML IB Deterministic case • N (or  ) EM: IB:

  33. N (or  ) • Do not speak about uniformity of X here • All the fixed points of L are mapped to all the fixed points of L • -F  L+ const • Every algorithm which finds a fixed point of L, induces a fixed point of L and vice versa • In case of several different f.p., the solution that maximized L is mapped to the solution that minimizes L.

  34. Example • This does not mean that q(t) = (t) N=  =

  35. When N, every algorithm increases F iff it decrease L with   • How large must N (or ) be? • How is it related to the “amount of uniformity” in n(x)?

  36. Simulations for iIB

  37. Simulations for EM

  38. Simulations • 200 runs = 100 (small N) + 100 (large N) • 58 runs IIB converged to a smaller value of (-F) than EM • 46 runs EM converged to (-F) related to a smaller value of L

  39. IBML Quality estimation for EM solution • The quality of IB solution is measured through the theoretic upper bound • Using mapping, one can adopt this measure for the ML esimation problem, for large enough N

  40. Summary: IB versus ML • ML and IB approaches are equivalent under certain conditions • Models comparison • The mixture model assumes that Y is independent of X given T(X): X  T  Y • In the IB framework, T is defined through the IB Markovian independence relation: T  X  Y • Can adapt the quality estimation measure from IB to the ML estimation problem, for large N

  41. Overview of the talk • Brief review of the Information Bottleneck • Maximum Likelihood • Information Bottleneck and Maximum Likelihood • Example from Image Segmentation (L. Hermes et. al.)

  42. The clustering model • Pixels oi, i = 1, …, n • Deterministic clusters c,,  = 1, …, k • Boolean assignment matrix MM = {0, 1}n x k ,Sn Min=1 • Observations

  43. oi q r • Observations

  44. Likelihood • Discretization of the color space into intervals Ij • Set • Data likelihood

  45. Relation to the IB

  46. Log-Likelihood • Assume that ni = const, set = ni then L = -log L IB functional

  47. Images generated from the learned statistics

More Related