Computational Statistics with Application to Bioinformatics

Computational Statistics withApplication to Bioinformatics Prof. William H. PressSpring Term, 2008The University of Texas at Austin Unit 14: SVD, PCA, and the Linear Perspective The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Consider a big data or design matrix as a set of points in M-space standardize each dimension to zero mean and unit variance and limit outliers, so as not to skew an L2 view of the world Singular Value Decomposition (SVD) gives orthogonal row and column bases and a diagonal mapping between them SVD is also the best “greedy” low-rank approximation scheme there exist efficient algorithms for the decomposition Principal Component Analysis (PCA) project the data along the (largest) eigendirections of its covariance matrix SVD makes this easy the problem is that “mathematically orthogonal” is only partly related to “distinct main effects” sometimes true of the largest SV, rarely true after that K main effects often do lie in the subspace of K SV’s just all mixed up! Not always easy to tell when decreasing SV’s are just noise even a Gaussian random matrix has structure in its SV distribution they’re sorted, hence have a nontrivial order statistic be suspicious of “X number of SV’s explain Y fraction of the variance” A good use for PCA (or SVD) is dimensional reduction illustrate on yeast gene expression data reduce from 300 to 20, or even 5 dimensions Eigengenes and Eigenarrays mathematically meaningful, but hard to interpret illustrate on an example constructed with two clusters of co-expressed genes Non-Negative Matrix Factorization: a good idea, but… Unit 14: SVD, PCA, and the Linear Perspective (Summary) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Linear “Stuff”: SVD, PCA, Eigengenes, and all that! We start with a “data matrix” or “design matrix”: X = {Xij}Let’s use gene expression data as the example. (I’ve already somewhat clustered this matrix by permuting rows and cols – doesn’t matter how. That’s so that you don’t just see a sea of random red and blue dots. Nothing we do in this unit will care about the row or col ordering.) N rows are data points, here genes 1 – 500M columns are the responses. View each as a vector in M (here 300) dimensional spaceFor gene expression data, each column is a separate micro array experiment under a different condition The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

In this unit, we always assume that the individual experiments (columns of X) have zero mean. While we’re at it, we might as well also scale to unit standard deviation. And, it’s a good idea to eliminate outliers. Matlab for this (and plotting the previous picture) is: load yeastarray_t2.txt; size(yeastarray_t2) ans = 500 300 yclip = prctile(yeastarray_t2(:),[1,99]) yclip = -204 244 data = max(yclip(1),min(yclip(2),yeastarray_t2)); dmean = mean(data,1); dstd = std(data,1); data = (data - repmat(dmean,[size(data,1),1]))./repmat(dstd,[size(data,1),1]); genecolormap = [min(1,(1:64)/32); 1-abs(1-(1:64)/32); min(1,(64-(1:64))/32)]'; colormap(genecolormap); image(20*data+32) clip outliers by percentile (bottom and top 1%) this is arcane Matlab colormap stuff for making a blue-to-white-to-red plot saturate to red or blue at ~1.5 s The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

rows (cols of V) are a complete orthonormal basis for M dimensional space s1 s2 VT … = X U diagonal matrix of positive “singular values” arranged from largest to smallest cols are orthonormal basis for an M dimensional subspace of N Singular Value Decomposition (SVD) Any matrix X (needn’t be square) can be decomposed, more-or-less uniquely, as follows: Degree of freedom (DOF) counting: MN – M(M+1)/2 + M + M2 – M(M+1)/2 MN = Decomposition has an efficient algorithm (of order the same workload as inverting a matrix). Matlab and NR3 have ready-to-use implementations. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

M X X U V s = i i i ¢ ¢ i 1 = We can write out the (middle) sums over the singular values explicitly. Each column of U gets paired with the corresponding row of VT (or column of V). This turns out to be the optimal decomposition of X into rank-1 matrices, optimal in the sense that the partial sums converge in the “greediest” way in L2 norm. (I.e., at each stage in the sum, there is no better decomposition.) If the data actually lie on a lower dimensional (than M) hyperplane that goes through the origin, then only that many si’s will be nonzero. That is why we subtracted the means! Or, in the general case we can just truncate the sum to get the best lower rank approximation to X. This can be useful for filtering out noise (we will see). Notice that this captures only a “linear” (hyperplane) view of the world. More complicated functional relationships that might decrease dimensionality are not, in general, identified by SVD or PCA. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

V.1 V.3 V2 T T T T T T 2 ( ( ) ( ) ( ) ) § X X V X U S V S U U X S V U S V V S V N i u s n g = = = = = Principal Component Analysis (PCA) Note that the (sample) covariance of the data is: diagonal Uses fact that we subtracted the means : ‹ x  x ›. So V is a rotation matrix that diagonalizes the covariance matrix. It follows that the data points in X have their largest variance in the V.1 direction.Then, in the orthogonal hyperplane, the 2nd largest variance is in the V.2 direction.And so forth. So we might usefully coordinatize the data points by their M projections along the V.i directions (instead of their M raw components). These projections are a matrix the same shape as X. Since the directions are orthonormal columns, it is simply rows are the data points,column components are the principal coordinates of that row The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

T T T T 2 ( ) ( ) ( ) ( ) ( ) X V X V U S U S S U U S S = = = Also, it’s easy to see that (by construction) the principal components of the points are uncorrelated, that is, have a diagonal correlation matrix: diagonal Lets plot our expression data in the plane of the top 2 principal components: [U S V] = svd(data, 0); pcacoords = U*S; plot(pcacoords(:,1),pcacoords(:,2),'r.') axis equal Direction 1 has larger variance than direction 2, and there is no correlation between the two, all as advertised. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

nucleus late in cell cycle early in cell cycle cytoplasm People who love PCA (I call them “linear thinkers”) always hope that the principal coordinates will magically correspond to distinct, real effects (“main effects”). Not! This is sometimes true for the 1st principal component, and rarely true after that. I think the reason is that orthogonality (in the mathematical sense of SVD) is rarely a useful decomposition of “distinct, main effects”, which tend to be highly correlated mathematically, even when they are “verbally orthogonal”. However, it is often true that ~K main effects are captured (somewhere) in the subspace of the first ~K principal components. So, PCA is a useful technique for dimensional reduction. Just don’t try to [over]interpret the meaning of individual coordinates! (Let’s see examples.) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

The squares of the SV’s are proportional to the portion of the total variance (L2 norm of X) that each accounts for. ssq = diag(S).^2; plot(ssq) semilogy(ssq,'.b') Where do these values start to be explainable simply as noise? Here? or here? or here? (Of course, we’ll never really know from a single data set, since, in the limit of fine-grain effects, signal is just repeatable noise!) But we can often make a guess after examining the data. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

One way to inform our guess is to compare to a matrix of Gaussian random deviates: fakedata = randn(500,300); [Uf Sf Vf] = svd(fakedata,0); sfsq = diag(Sf).^2; semilogy(sfsq,'.r') Why does fake show a trend at all?Because even random numbers are monotonic if you sort them! We are seeing the “order statistics” for SVs from a Gaussian random matrix. Fake has to be higher than real here, because area under the curves (if they were plotted on a linear scale) has to be the same for the two curves (same total variance or L2 norm) I’d say it’s questionable that there’s much information in our real data set beyond around here The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Sometimes, people plot the fractional variance as a function of number of SVs, which also shows how we converge to an exact SV decomposition: ssqnorm = cumsum(ssq)/sum(ssq); sfsqnorm = cumsum(sfsq)/sum(sfsq); plot(ssqnorm,'b') hold on plot(sfsqnorm,'r') hold off yeast data Gaussian random straight line You might have expected the Gaussian random to be close to the straight line, since each random SV should explain about the same amount of variance. But, as before, we’re seeing the (sorted) order statistics effect. So it is actually rather hard to interpret from this plot (if you had only the blue curve) what is real versus noise and how impressed you should be by a rapid initial rise. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

For the data in this example, a sensible use of PCA (i.e., SVD) would be to project the data into the subspace of the first ~20 SVs, where we can be sure that it is not noise. strunc = diag(S); strunc(21:end) = 0; filtdata = U*diag(strunc)*V'; colormap(genecolormap); image(20*filtdata+32) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

original data set: The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Or, just 5 SV’s: strunc(6:end) = 0; filtdata = U*diag(strunc)*V'; colormap(genecolormap); image(20*filtdata+32) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Eigengenes and Eigenarrays Thus far, we haven’t actually “looked at” the largest-SV orthogonal basis vectors, namely the first few columns of U and V plot(U(:,1),'b') hold on plot(U(:,2),'r') plot(U(:,3),'g') hold off plot(V(:,1),'b') hold on plot(V(:,2),'r') plot(V(:,3),'g') hold off These are “eigenarrays”, the linear combination of experiments that explain the most data. times SV These are “eigengenes”, the linear combination of genes that explain the most data. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

However, except in special cases, eigengenes and eigenarrays are not easily interpreted. Since we can permute the order of experiments and/or genes in the data, the “shape” of the eigenfunctions has no particular meaning here. Also, as discussed before, main effects generally don’t correspond 1-to-1 to eigenanythings. At best, ~K main effects are in ~K eigenthingies. Let’s construct a toy gene expression example with 2 main effects, and see how they show up in eigengenes and eigenarrays. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Make a toy example with (what we would call) 2 main effects: pdata = randn(500,300); pdata(101:200,51:100) = pdata(101:200,51:100) + 1; pdata(301:400,201:250) = pdata(301:400,201:250) - 1; pmean = mean(pdata,1); pstd = std(pdata,1); pdata = (pdata - repmat(pmean,[size(pdata,1),1]))./repmat(pstd,[size(pdata,1),1]); colormap(genecolormap) image(20*pdata+32) these genes overexpress in these arrays these genes underexpress in these arrays note slight vertical banding from subtracting the means The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

As (naively?) expected, there are exactly two large principal components (or SVs) [Up Sp Vp] = svd(pdata,0); spsq = diag(Sp).^2; semilogy(spsq(1:50),'.b') continues on to 300 So should we expect the eigengenes/eigenarrays to show the separate main effects? The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Plot 1st three eigengenes and eigenarrays plot(Up(:,1),'b') hold on plot(Up(:,2),'r') plot(Up(:,3),'g') plot(Vp(:,1),'b') hold on plot(Vp(:,2),'r') plot(Vp(:,3),'g') First two contain an (orthogonalized) mixture of the two main effects.Third one is, as we expect, “random”. If we had 20 main effects, the first 20 eigengenes/arrays would be mixtures of them. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

M T T T k k X X X F F F S G G G ¡ ¼ ¼ X X F G ¼ i i ¢ ¢ i 1 = There exist methods of Non-negative Matrix Factorization (NMF) whose purpose is to stop main effects from mixing when positivity matters. Example: cluster text documents by word counts vector of 0 or + weights of documents in i th cluster words vector of 0 or + weights of words in i th cluster = X documents counts i.e. positive matrixes For our problem, since genes can also be under-expressed, we would need diagonal matrix of ±1 The problem with these methods is1. The factorizations are (far from) unique!2. The computational algorithms are little better than brute-force minimization of How to find the global minimum?? 3. There are other good clustering algorithms (GMMs, hierarchical, etc.) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press

Computational Statistics with Application to Bioinformatics