120 likes | 302 Views
Canonical correlations. Purpose Calculation of canonical correlations Relevance to prediction Relevance to correspondence analysis R commands. Purpose.
E N D
Canonical correlations • Purpose • Calculation of canonical correlations • Relevance to prediction • Relevance to correspondence analysis • R commands
Purpose Canonical correlation analysis tries to give a relation between two sets of variables. If we have variable vectors x and y then canonical correlation analysis finds a linear combination of x ( aTx) and y ( bTy) so that correlation between these new variables is maximum. This analysis may give some insight into the structure of the relation between these variables. PCA tries to find internal structure between variables, whereas canonical correlation between set of variables. Canonical correlation analysis in some sense is an extension of correspondence analysis. Like correspondence analysis, we want to find relation between two sets of variables. It is, again, dimension reduction technique. If we have many variables x and many variables y we want to find as small number as possible new variables, that are linear combination of the original variables, that describes relation between these two sets of variables.
Population canonical correlation Let us assume that we have two vectors of variables, x and y with dimensions of p and q. Let us define covariance matrix for the vector (x,y) as: Where 11 is the covariance matrix for the variable x, 22 is the covariance matrix for y and 12 is the covariance matrix between x and y. Dimension of these matrices are pxp, qxq and pxq respectively. We want to find a linear combination of variables of x and y so that correlation between them is maximum. Let us denote the corresponding coefficients by a and b and corresponding linear combinations by =aTx and =bTy. Then correlation between these variables can be written as: The purpose of the canonical correlation analysis is to maximise this correlation and find the coefficients a and b as well as corresponding correlation.
Cont. Population canonical correlation The problem of finding maximum correlation and corresponding coefficients is equivalent to constrained maximisation: There are several ways of solving this problem. Here is one of them. Using Lagrange multipliers technique: Now define the following matrices and variables: Now the problem reduces to That is the eigenvalue, eigenvector problem.
Cont. Population canonical correlation Once the eigenvalue and eigenvector problem has been solved we can find vectors a and b. These vectors are called canonical correlation vectors. Linear combination formed by the scalar product of aTx and bTy are called canonical correlation variables. Square roots of values of eigenvalues are called canonical correlations. Obviously there is only min(p,q) canonical variables and correlations. Relation between l and can be found if we use the constraints: Thus we found the canonical correlations and corresponding vectors and variables. There are other ways of deriving these results. They are based on SVD. Usual implementations use SVD approach.
Sample canonical correlations Usually population covariance matrix is not known. In this case as usual sample covariance matrix is used. Then the results of canonical correlation is maximum likelihood estimators of the corresponding population values. Care should be taken when using sample covariance matrix (as usual). If there are better ways of calculating sample correlation matrices then it would be better to use them (robust estimators etc).
Prediction One of the purposes of canonical correlation analysis is to predict a set of variables using another set of variables. In our case predict values of y when we know values of x. Let us assume that we have a data matrix (X,Y) with n observations (individuals) and (p+q) variables. Let ai and bibe ith canonical correlation vectors. Then the vectors Xai and Ybi are scores of the observations on the canonical correlation variables. For each observation scores can be written as: Where c1, c2, d1 and d2 are coefficients. If x is the predictor and y is the predicted variable and canonical correlation is ri then predicted value can be written as (the prediction is most important for the first canonical correlation variable)
Canonical correlation and correspondence analysis There is a relation between canonical correlation analysis and correspondence analysis. In correspondence analysis we had two types of categorical variables and we wanted to order (one dimensional), find relation between them. Now let us assume we have a contingency table: excellent very good good fair poor Drug A 6 8 10 1 5 Drug B 12 8 3 3 5 Drug C 0 3 12 6 10 Drug D 1 1 8 12 7 Number of individuals is the sum of all values. Let us say this value is n. Now we want to define dummy variables. These variables correspond belongness of individual to rows (one set of variables) and columns.
Cont. Canonical correlation and correspondence analysis Let us define x and y variables as follows Thus we have (p+q) variables, where p is the number of rows and q is the number of columns. Number of “observations” or individuals is the sum of all values in the table. Now finding canonical correlation for this table is equivalent to the correspondence analysis. That is the reason why in correspendence analysis the results are called canonical correlations.
R commands for canonical correlation cancor(x,y)
References • Mardia, K.V. Kent, J.T. and Bibby, J.M. (2003) Multivariate analysis