Multivariate analysis is necessary whenever more than one characteristic is observed on each individual under study. Applications arise in very many areas of study. This book provides a comprehensive introduction to available techniques for analysing date of this form, written in a style that should appeal to non-specialists as well as to statisticians. In particular, geometric intuition is emphasized in preference to algebraic manipulation wherever possible. The new edition includes a survey of the most recent developments in the subject.
I am using CCA for my work and want to understand something.
This is my MATLAB code. I have only taken 100 samples to better understand the concepts of CCA.
clc;clear all;close all; load carbig; data = [Displacement Horsepower Weight Acceleration MPG]; data(isnan(data))=0; X = data(1:100,1:3); Y = data(1:100,4:5); [wx,wy,~,U,V] = CCA(X,Y); clear Acceleration Cylinders Displacement Horsepower MPG Mfg Model Model_Year Origin Weight when org subplot(1,2,1),plot(U(:,1),V(:,1),'.'); subplot(1,2,2),plot(U(:,2),V(:,2),'.');
My plots are coming like this:
This points out that in the 1st figure (left), the transformed variables are highly correlated with little scatter around the central axis. While in the 2nd figure(right), the scatter around the central axis is much more.
As I understand from here that CCA maximizes the correlation between the data in the transformed space. So I tried to design a matching score which should return a minimum value if the vectors are maximally correlated. I tried to match each vector of
U(i,:) with that of
i,j going from 1 to 100.
%% Finding the difference between the projected vectors for i=1:size(U,1) cost = repmat(U(i,:),size(U,1),1)- V; for j=1:size(U,1) c(i,j) = norm(cost(j,:),size(U,2)); end [~,idx(i)] = min(c(i,:)); end
Ideally idx should be like this :
idx = 1 2 3 4 5 6 7 8 9 10 ....
as they are maximally correlated. However my output comes something like this :
idx = 80 5 3 1 4 7 17 17 17 10 68 78 78 75 9 10 5 1 6 17 .....
I dont understand why this happens.
Thanks in advance.
First, Let me transpose your code in R2014b:
load carbig; data = [Displacement Horsepower Weight Acceleration MPG]; % Truncate the data, to follow-up with your sample code data = data(1:100,:); nans = sum(isnan(data),2) > 0; [wx, wy, r, U, V,] = canoncorr(X(~nans,1:3),X(~nans,4:5));
OK, now the trick is that the vectors which are maximally correlated in the CCA subspace are the column vectors
V(:,2), and not the row vectors
U(i,:), as you are trying to compute. In the CCA subspace, vectors should be N-dimensional (here
N=100), and not simple 2D vectors. That's the reason why visualization of CCA results is often quite complicated !
By the way, the correlations are given by the third output of
canoncorr, that you (intentionally ?) choosed to skip in your code. If you check its content, you'll see that the correlations (i.e. the vectors) are well-ordered:
r = 0.9484 0.5991