Class 8. 24 Feb 2015 Instructor: Bhiksha Raj

Machine Learning for Signal ProcessingPrincipal Component Analysis &Independent Component Analysis Class 8. 24 Feb 2015 Instructor: Bhiksha Raj 11755/18797

Recall: Representing images • The most common element in the image: background • Or rather large regions of relatively featureless shading • Uniform sequences of numbers 11-755/18-797

“Bases” • “Bases” are the “standard” units such that all instances can be expressed a weighted combinations of these units • Ideal requirements: Bases must be orthogonal • Checkerboards are one choice of bases • Orthogonal • But not “smooth” • Other choices of bases: Complex exponentials, Wavelets, etc.. B1 B2 B3 B4 B5 B6 11-755/18-797

Recap: Data specific bases? • Checkerboards, Complex exponentials, Wavelets are data agnostic.. • We use the same bases regardless of the data we analyze • Image of face vs. Image of a forest • Segment of speech vs. Seismic rumble • How about data specific bases • Bases that consider the underlying data • E.g. is there something better than checkerboards to describe faces • Something better than complex exponentials to describe music? 11-755/18-797

Recap: The Energy Compaction Property • Define “better”? • The description • The ideal: • If the description is terminated at any point, we should still get most of the information about the data • Error should be small 11-755/18-797

A collection of least squares typical faces • Assumption: There are a set of K “typical” faces that captures most of all faces • Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk • V = [V1 V2 V3] • Estimate V to minimize the squared error • How? What is V? 11-755/18-797

Abstracting the problem: Finding the FIRST typical face • The problem of finding the first typical face V1:Find the V for which the total projection error is minimum! • This “minimum squared error” V is our “best” first typical face V1 Pixel 2 Pixel 1 11-755/18-797

Formalizing the Problem: Error from approximating a single vector v vvTx • Projection of a vector x on to a vector • Assuming v is of unit length: Approximating: x = wv y x-vvTx x x 11-755/18-797

With multiple bases Represents a K-dimensional subspace V • Projection for a vector • Error vector = • Error length = V=[v1 v2..vK] VVTx y x-VVTx x x 11-755/18-797

With multiple bases V • Error for one vector: • Error for many vectors • Goal: Estimate V to minimize this error! V=[v1 v2..vK] y x 11-755/18-797

The correlation matrix • The encircled term is the correlation matrix X = Data Matrix Correlation = XT = TransposedData Matrix 11-755/18-797

The best “basis” v • The minimum-error basis is found by solving • v is an Eigen vector of the correlation matrix R • l is the corresponding Eigen value y x 11-755/18-797

Minimizing error • With constraint VTV = I, objective to minimize • L is a diagonal matrix • Constraint simply ensures that vTv = 1 • Differentiating w.r.t Vand equating to 0 11-755/18-797

Finding the optimal K bases • Total error = • Select K eigen vectors corresponding to the K largest Eigen values 11-755/18-797

Eigen Faces! • Arrange your input data into a matrix X • Compute the correlation R = XXT • Solve the Eigen decomposition: RV = LV • The Eigen vectors corresponding to the K largest eigen values are our optimal bases • We will refer to these as eigen faces. 11-755/18-797

Energy Compaction: Principle • Find the directions that capture most energy Y X 11-755/18-797

What about Prediction? • Does X predict Y? Y X X1 11-755/18-797

What about Prediction? • Does X predict Y? Y Linear or Affine? X X1 11-755/18-797

Linear vs. Affine • The model we saw • Approximate every face f as f = wf,1 V1+ wf,2 V2 +... + wf,kVk • Linear combination of bases • If you add a constant f = wf,1 V1+ wf,2 V2 +... + wf,kVk + m • Affine combination of bases 11-755/18-797

Estimation with the Constant • Estimate all parameters of • f = wf,1 V1+ wf,2 V2 +... + wf,kVk + m • Parameters: • wf,1,V1,wf,2,V2, wf,k,Vk, m 11755/18797

Problem • f = wf,1 V1+ wf,2 V2 +... + wf,k Vk + m • Find the slope of the line • Find the projection of points on the line • Find intercept m • Problem: Any “m” on the line will work (w’s vary with m) Y Linear or Affine? X X1 11-755/18-797

Proof by assertion • Estimate all parameters of • f = wf,1 V1+ wf,2 V2 +... + wf,kVk + m • The mean of all the vectors “f” will lie on the plane! 11755/18797

Estimation the remaining 11-755/18-797

Estimating the Affine model 11-755/18-797

Properties of the affine model 11-755/18-797

Linear vs. Affine • The model we saw • Approximate every face f as f = wf,1 V1+ wf,2 V2 +... + wf,kVk • TheKarhunenLoeve Transform • Retains maximum Energy for any order k • If you add a constant f = wf,1 V1+ wf,2 V2 +... + wf,kVk + m • Principal Component Analysis • Retains maximum Variance for any order k 11-755/18-797

How do they relate • Relationship between correlation matrix and covariance matrix R = C + mmT • KarhunenLoevebases are Eigen vectors of R • PCA bases are Eigen vectors of C • How do they relate • Not easy to say.. 11-755/18-797

The Eigen vectors • The Eigen vectors of C are the major axes of the ellipsoid Cv, where v are the vectors on the unit sphere 11-755/18-797

The Eigen vectors mmT • The Eigen vectors of Rare the major axes of the ellipsoid Cv + mmTv • Note that mmT has rank 1 and mmTv is a line 11-755/18-797

The Eigen vectors mmT • The principal Eigenvector of Rlies between the principal Eigen vector of C and m • Similarly the principal Eigen value • Similar logic is not easily extendable to the other Eigenvectors, however 11-755/18-797

Eigenvectors • Turns out: Eigenvectors of the correlation matrix represent the major and minor axes of an ellipse centered at the origin which encloses the data most compactly • The SVD of data matrix X uncovers these vectors • KLT Pixel 2 Pixel 1 11-755/18-797

Eigenvectors • Turns out: Eigenvectors of the covariance represent the major and minor axes of an ellipse centered at the mean which encloses the data most compactly • PCA uncovers these vectors • In practice, “Eigen faces” refers to PCA faces, and not KLT faces Pixel 2 Pixel 1 11-755/18-797

What about sound? • Finding Eigen bases for speech signals: • Look like DFT/DCT • Or wavelets • DFTs are pretty good most of the time 11-755/18-797

Eigen Analysis • Can often find surprising features in your data • Trends, relationships, more • Commonly used in recommender systems • An interesting example.. 11-755/18-797

Eigen Analysis • Cheng Liu’s research on pipes.. • SVD automatically separates useful and uninformative features 11-755/18-797

Correlation vs. Causation • The consumption of burgers has gone up steadily in the past decade • In the same period, the penguin population of Antarctica has gone down Correlation, not Causation (unless McDonalds has a top-secret Antarctica division) 11755/18797

The concept of correlation • Two variables are correlated if knowing the value of one gives you information about the expected value of the other penguins Penguin population burgers Burger consumption Time 11755/18797

The statistical concept of correlatedness • Two variables X and Y are correlated if If knowing X gives you an expected value of Y • X and Y are uncorrelated if knowing X tells you nothing about the expected value of Y • Although it could give you other information • How? 11755/18797

A brief review of basic probability • Uncorrelated: Two random variables X and Y are uncorrelated iff: • The average value of the product of the variables equals the product of their individual averages • Setup: Each draw produces one instance of X and one instance of Y • I.e one instance of (X,Y) • E[XY] = E[X]E[Y] • The average value of Y is the same regardless of the value of X 11755/18797

Correlated Variables • Expected value of Y given X: • Find average of Y values of all samples at (or close) to the given X • If this is a function of X, X and Y are correlated P1 Penguin population P2 b1 b2 Burger consumption 11755/18797

Uncorrelatedness • Knowing X does not tell you what the average value of Y is • And vice versa Average Income b1 b2 Burger consumption 11755/18797

Uncorrelated Variables X as a function of Y Y as a function of X • The average value of Y is the same regardless of the value of X and vice versa Average Income Burger consumption 11755/18797

Uncorrelatedness • Which of the above represent uncorrelated RVs? 11755/18797

The notion of decorrelation • So how does one transform the correlated variables (X,Y) to the uncorrelated (X’, Y’) ? Y Y’ X X’ 11755/18797

What does “uncorrelated” mean Assuming0 mean • E[X’] = constant (0) • E[Y’] = constant (0) • E[X’|Y’] = 0 • E[X’Y’] = EY’[E[X’|Y’]] = 0 0 • If Y is a matrix of vectors, YYT = diagonal Y’ X’ 11755/18797

Decorrelation • Let X be the matrix of correlated data vectors • Each component of X informs us of the mean trend of other components • Need a transform M such that if Y = MX such that the covariance of Y is diagonal • YYT is the covariance if Y is zero mean • YYT = Diagonal • MXXTMT = Diagonal • M.Cov(X).MT = Diagonal 11755/18797

Decorrelation • Easy solution: • Eigen decomposition of Cov(X): Cov(X) = ELET • EET = I • Let M = ET • MCov(X)MT = ETELETE = L = diagonal • PCA: Y = MX • Diagonalizesthe covariance matrix • “Decorrelates” the data 11755/18797

Class 8. 24 Feb 2015 Instructor: Bhiksha Raj