1 / 23

Real-time on-line learning of transformed hidden Markov models

Real-time on-line learning of transformed hidden Markov models. Nemanja Petrovic, Nebojsa Jojic, Brendan Frey and Thomas Huang Microsoft, University of Toronto, University of Illinois. Six break points vs. six things in video. Traditional video segmentation: Find breakpoints

damisi
Download Presentation

Real-time on-line learning of transformed hidden Markov models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-time on-line learning of transformed hidden Markov models Nemanja Petrovic, Nebojsa Jojic, Brendan Frey and Thomas Huang Microsoft, University of Toronto, University of Illinois

  2. Six break points vs. six things in video • Traditional video segmentation: Find breakpoints Example: MovieMaker (cut and paste) • Our goal: Find possibly recurring scenes or objects timeline 1 2 3 2 4 1 4 3 2 3 2 3 5 6 REPRESENTATIVE FRAMES

  3. Transformed hidden Markov model Class with prior P(c=k) = πk c Translation T with uniform prior Latent image z T z P(z|c) = N(z;μc,Φc) x x = Tz Ex = Tμ Var x = TΦTT Observed frame x p(x|c,T) = N(x; Tμc, TΦcTT) Generation is repeated for each frame of the sequence, with the pair (T,c) being the state of a Markov chain.

  4. Goal: maximize total likelihood of a dataset logp(X) = logΣ{T,c}Σzp(X,{T,c},Z) = logΣ{T,c}Σzq({T,c},Z)p(X,{T,c},Z)/q({T,c},Z) ≥ Σ{T,c}Σzq({T,c},Z)logp(X,{T,c},Z) - Σ{T,c}Σzq({T,c},Z)logq({T,c},Z) = B We express q(T,c,z) = q({T,c}) * q(Z|{T,c}) {T,c} represents values of transformation and class for all frames, i.e., the path that the video sequence takes through the state space of the model. Instead of the likelihood, we optimize the bound B, which is tight for q=p({T,c},Z|X)

  5. Posterior approximation We allow q({T,c}) to have a non-zero probability only on M most probable paths: q({T,c}) = Σm=1:M rmδ({T,c} - {T,c}*m) (Viterbi 1982) This reduces a number of problems with adaptive scaling in the exact forward-backward inference.

  6. Expensive part of the E step Find quick way to calculate log p(x|c,T) = -N/2 log(2π) – ½ log|TΦcTT| - ½ (x-Tμc)T(TΦcTT)-1(x-Tμc) for all possible shifts T in the E step of EM algorithm Shifted cluster mean Tμ log p Frame x T

  7. = Σ .* T Computing Mahalanobis distance using FFTs All terms that have to be evaluated for all T can be expressed as correlations, e.g. : xT(TΦTT)-1Tμ = xTT(Φ-1μ) = xTT(diag Φ-1.*μ) μ (where summation is over pixels) = sum x.*T Φ μ = IFFT FFT(x) .* conj( FFT ) Φ N log N versus N2 !

  8. Parameter optimization F = ΣTΣcΣzq(T,c,z)logp(X,T,c,z) = ΣTΣcΣzq({T,c}) * q(z|{T,c}) x (logπ{Tc} + Σtimelogp(xt,zt|Tt,ct) + Σtimelogp(ct+1|ct) log p(Tt+1|Tt,ct)) Solve ∂F/∂()=0 for an estimated q.

  9. On-line vs. batch EM Example: Update equation for the class mean ΣtΣT q(Tt,ct)E[z|xt,ct,Tt] = Σtq(ct) μct Batch EM: • solve forμusing all frames. • Inference and parameter optimization iterated. On-line EM: • rewrite the equation for one extra frame • establish the relationsip between μ(t+1)andμ(t). • Parameters updated after each frame. No need for iteration.

  10. Reducing the complexity of the M step ΣT q(Tt,ct)E[z|xt,ct,Tt] can be expressed as a sum of convolutions. For example, when there is no observation noise, E[z|xt,ct,Tt ]= TtTxt, and ΣT q(Tt,ct)E[z|xt,ct,Tt] = IFFT (FFT(q).* FFT(x)) (similar trick applies to variance estiamates)

  11. How to deal with scale and rotation? Represent pixels on a polar grid! Shifts in the log-polar coordinates correspond to scale and rotation changes in the Cartesian coordiante system scale rotation

  12. Estimating the number of classes • The algorithm is initialized with a single class • A new class is introduced whenever the frame likelihood drops bellow a threshold • The classes can be merged in the end to achieve a more compact representation

  13. Clustering a 20-minute whale watching video

  14. Clustering a 20-minute beach video

  15. Shots from the first class 0 min 9 min

  16. Discovering objects using motion priors Three characteristic frames from 240x320 input sequence Different motion prior predefined for each of the classes Learned means and variances

  17. Tracking results

  18. Summary Now 120x160 images Full color images 5-10 frames/sec On-line EM Approximate inference Variable number of clusters All possible translations Memory efficient Before - CVPR 99/00 28x44 images Grayscale images 1 day of computation for 15 sec video Batch EM Exact inference Fixed number of clusters Limited number of translations Memory inefficient

  19. Sneak preview: Panoramic THMMs P(c=k) = πk c T z P(z|c) = N(z;μc,Φc) x WT x = WTz Ex = WTμ Var x = WTΦTTWT p(x|c,T) = N(x; WTμc, WTΦcTTWT)

  20. Video clustering - model • Appearance mean variance • Camera/object motion • Temporal constraints • Unsupervised learning – the only input is the video

  21. Current implementation • DirectShow filter for frame clustering (5-15 frames/sec!) • Translation invariance • On-line learning algorithm • Classes repeating across video • Potential applications: • Video segmentation • Content based search/retrieval • Short video summary creation • DVD chapter creation

  22. Comparing with layered sprites “Perfect” segmentation Layered sprites. Jojic, CVPR 01 But, THMM is hundreds/thousands of time faster!

  23. Example with more content

More Related