1 / 70

Machine Learning for Signal Processing Linear Gaussian Models

Machine Learning for Signal Processing Linear Gaussian Models. Class 1 7. 30 Oct 2014 Instructor: Bhiksha Raj. Recap: MAP Estimators. MAP ( Maximum A Posteriori ): Find a “best guess” for y (statistically), given known x y = argmax Y P ( Y | x ). Recap: MAP estimation.

towsley
Download Presentation

Machine Learning for Signal Processing Linear Gaussian Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning for Signal ProcessingLinear Gaussian Models Class 17. 30 Oct 2014 Instructor: Bhiksha Raj 11755/18797

  2. Recap: MAP Estimators • MAP (Maximum A Posteriori): Find a “best guess” for y (statistically), given known x y = argmax Y P(Y|x) 11755/18797

  3. Recap: MAP estimation • x and y are jointly Gaussian • z is Gaussian 11755/18797

  4. MAP estimation: Gaussian PDF Y F1 X 11755/18797

  5. MAP estimation: The Gaussian at a particular value of X x0 11755/18797

  6. Conditional Probability of y|x • The conditional probability of y given x is also Gaussian • The slice in the figure is Gaussian • The mean of this Gaussian is a function of x • The variance of y reduces if x is known • Uncertainty is reduced 11755/18797

  7. MAP estimation: The Gaussian at a particular value of X Most likelyvalue F1 x0 11755/18797

  8. MAP Estimation of a Gaussian RV x0 11755/18797

  9. Its also a minimum-mean-squared error estimate • Minimize error: • Differentiating and equating to 0: The MMSE estimate is the mean of the distribution 11755/18797

  10. For the Gaussian: MAP = MMSE Most likelyvalue is also The MEANvalue • Would be true of any symmetric distribution 11755/18797

  11. A Likelihood Perspective • y is a noisy reading of aTx • Error e is Gaussian • Estimate A from = + 11755/18797

  12. The Likelihood of the data • Probability of observing a specific y, given x, for a particular matrix a • Probability of collection: • Assuming IID for convenience (not necessary) 11755/18797

  13. A Maximum Likelihood Estimate • Maximizing the log probability is identical to minimizing the least squared error 11755/18797

  14. A problem with regressions • ML fit is sensitive • Error is squared • Small variations in data  large variations in weights • Outliers affect it adversely • Unstable • If dimension of X >= no. of instances • (XXT) is not invertible 11755/18797

  15. MAP estimation of weights y=aTX+e a • Assume weights drawn from a Gaussian • P(a) = N(0,s2I) • Max. Likelihood estimate • Maximum a posteriori estimate X e 11755/18797

  16. MAP estimation of weights • Similar to ML estimate with an additional term • P(a) = N(0,s2I) • Log P(a) = C – log s – 0.5s-2 ||a||2 11755/18797

  17. MAP estimate of weights • Equivalent to diagonal loading of correlation matrix • Improves condition number of correlation matrix • Can be inverted with greater stability • Will not affect the estimation from well-conditioned data • Also called Tikhonov Regularization • Dual form: Ridge regression • MAP estimate of weights • Not to be confused with MAP estimate of Y 11755/18797

  18. MAP estimate priors • Left: Gaussian Prior on W • Right: Laplacian Prior 11755/18797

  19. MAP estimation of weights with laplacian prior • Assume weights drawn from a Laplacian • P(a) = l-1exp(-l-1|a|1) • Maximum a posteriori estimate • No closed form solution • Quadratic programming solution required • Non-trivial 11755/18797

  20. MAP estimation of weights with laplacian prior • Assume weights drawn from a Laplacian • P(a) = l-1exp(-l-1|a|1) • Maximum a posteriori estimate • … • Identical to L1 regularized least-squares estimation 11755/18797

  21. L1-regularized LSE • No closed form solution • Quadratic programming solutions required • Dual formulation • “LASSO” – Least absolute shrinkage and selection operator subject to 11755/18797

  22. LASSO Algorithms • Various convex optimization algorithms • LARS: Least angle regression • Pathwise coordinate descent.. • Matlab code available from web 11755/18797

  23. Regularized least squares • Regularization results in selection of suboptimal (in least-squares sense) solution • One of the loci outside center • Tikhonovregularization selectsshortestsolution • L1regularization selectssparsest solution Image Credit: Tibshirani 11755/18797

  24. LASSO and Compressive Sensing • Given Y and X, estimate sparse W • LASSO: • X = explanatory variable • Y = dependent variable • a = weights of regression • CS: • X = measurement matrix • Y = measurement • a = data = X Y a 11755/18797

  25. MAP / ML / MMSE • General statistical estimators • All used to predict a variable, based on other parameters related to it.. • Most common assumption: Data are Gaussian, all RVs are Gaussian • Other probability densities may also be used.. • For Gaussians relationships are linear as we saw.. 11755/18797

  26. Gaussians and more Gaussians.. • Linear Gaussian Models.. • But first a recap 11755/18797

  27. A Brief Recap • Principal component analysis: Find the K bases that best explain the given data • Find B and C such that the difference between D and BC is minimum • While constraining that the columns of B are orthonormal 11755/18797

  28. Remember Eigenfaces • Approximate everyface f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk • Estimate V to minimize the squared error • Error is unexplained by V1.. Vk • Error is orthogonal to Eigenfaces 11755/18797

  29. Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy 11755/18797

  30. Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy • Eigenvectors of the Covariance matrix: • Principal directions of tightest ellipse centered on data • Directions that retain maximum variance 11755/18797

  31. Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy KARHUNEN LOEVE TRANSFORM • Eigenvectors of the Covariance matrix: • Principal directions of tightest ellipse centered on data • Directions that retain maximum variance 11755/18797

  32. Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy Principal Component Analysis KARHUNEN LOEVE TRANSFORM • Eigenvectors of the Covariance matrix: • Principal directions of tightest ellipse centered on data • Directions that retain maximum variance 11755/18797

  33. Karhunen Loeve vs. PCA • If the data are naturally centered at origin, KLT == PCA • Following slides refer to PCA! • Assume data centered at origin for simplicity • Not essential, as we will see.. 11755/18797

  34. Remember Eigenfaces • Approximate everyface f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk • Estimate V to minimize the squared error • Error is unexplained by V1.. Vk • Error is orthogonal to Eigenfaces 11755/18797

  35. Eigen Representation = w11 + e1 • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance 0 w11 e1 Illustration assuming 3D space 11755/18797

  36. Representation = w12 + e2 Error is at 90oto the eigenface • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance w12 90o e2 Illustration assuming 3D space 11755/18797

  37. Representation • K-dimensional representation • Error is orthogonal to representation All data with the samerepresentation wV1lie a plane orthogonal towV1 0 w 11755/18797

  38. With 2 bases = w11 + w21 + e1 Error is at 90oto the eigenfaces • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance 0,0 e1 w11 w21 Illustration assuming 3D space 11755/18797

  39. With 2 bases = w12 + w22 + e2 Error is at 90oto the eigenfaces • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance w12 w22 e2 Illustration assuming 3D space 11755/18797

  40. In Vector Form Xi = w1iV1 + w2iV2 + ei Error is at 90oto the eigenfaces • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance w12 w22 V2 D2 e2 V1 11755/18797

  41. In Vector Form Xi = w1iV1 + w2iV2 + ei Error is at 90oto the eigenface • K-dimensional representation • x is a D dimensional vector • V is a D x K matrix • w is a K dimensional vector • e is a D dimensional vector w12 w22 V2 D2 e2 V1 11755/18797

  42. Learning PCA • For the given data: find the K-dimensional subspace such that it captures most of the variance in the data • Variance in remaining subspace is minimal 11755/18797

  43. Constraints Error is at 90oto the eigenface • VTV = I : Eigen vectors are orthogonal to each other • For every vector, error is orthogonal to Eigen vectors • eTV = 0 • Over the collection of data • Average wwT = Diagonal : Eigen representations are uncorrelated • Determinant eTe = minimum: Error variance is minimum • Mean of error is 0 w12 w22 V2 D2 e2 V1 11755/18797

  44. A Statistical Formulation of PCA Error is at 90oto the eigenface • x is a random variable generated according to a linear relation • w is drawn from an K-dimensional Gaussian with diagonal covariance • e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian • Estimate V (and B) given examples of x w12 w22 V2 D2 e2 V1 11755/18797

  45. Linear Gaussian Models!! • x is a random variable generated according to a linear relation • w is drawn from a Gaussian • e is drawn from a 0-mean Gaussian • Estimate V given examples of x • In the process also estimate B and E 11755/18797

  46. Linear Gaussian Models!! • x is a random variable generated according to a linear relation • w is drawn from a Gaussian • e is drawn from a 0-mean Gaussian • Estimate V given examples of x • In the process also estimate B and E • PCA is a specific instance of a linear Gaussian model with particularconstraints • B = Diagonal • VTV = I • E is low rank 11755/18797

  47. Linear Gaussian Models • Observations are linear functions of two uncorrelated Gaussian random variables • A “weight” variable w • An “error” variable e • Error not correlated to weight: E[eTw] = 0 • Learning LGMs: Estimate parameters of the model given instances of x • The problem of learning the distribution of a Gaussian RV 11755/18797

  48. LGMs: Probability Density • The mean of x: • The Covariance of x: 11755/18797

  49. The probability of x • x is a linear function of Gaussians: x is also Gaussian • Its mean and variance are as given 11755/18797

  50. Estimating the variables of the model • Estimating the variables of the LGM is equivalent to estimating P(x) • The variables are m, V, B and E 11755/18797

More Related