Principles of Multivariate Analysis

Wojtek Krzanowski

Mentioned 1

Multivariate analysis is necessary whenever more than one characteristic is observed on each individual under study. Applications arise in very many areas of study. This book provides a comprehensive introduction to available techniques for analysing date of this form, written in a style that should appeal to non-specialists as well as to statisticians. In particular, geometric intuition is emphasized in preference to algebraic manipulation wherever possible. The new edition includes a survey of the most recent developments in the subject.

More on Amazon.com

Mentioned in questions and answers.

I am using CCA for my work and want to understand something.

This is my MATLAB code. I have only taken 100 samples to better understand the concepts of CCA.

clc;clear all;close all;

load carbig;
data = [Displacement Horsepower Weight Acceleration MPG];

data(isnan(data))=0;

X = data(1:100,1:3);
Y = data(1:100,4:5);

[wx,wy,~,U,V] = CCA(X,Y);

clear Acceleration Cylinders Displacement Horsepower MPG Mfg Model Model_Year Origin Weight when org

subplot(1,2,1),plot(U(:,1),V(:,1),'.');
subplot(1,2,2),plot(U(:,2),V(:,2),'.');

My plots are coming like this:Image

This points out that in the 1st figure (left), the transformed variables are highly correlated with little scatter around the central axis. While in the 2nd figure(right), the scatter around the central axis is much more.

As I understand from here that CCA maximizes the correlation between the data in the transformed space. So I tried to design a matching score which should return a minimum value if the vectors are maximally correlated. I tried to match each vector of U(i,:) with that of V(j,:) with i,j going from 1 to 100.

%% Finding the difference between the projected vectors
for i=1:size(U,1)
    cost = repmat(U(i,:),size(U,1),1)- V;
    for j=1:size(U,1)
        c(i,j) = norm(cost(j,:),size(U,2));
    end
    [~,idx(i)] = min(c(i,:));
end

Ideally idx should be like this :

idx = 1 2 3 4 5 6 7 8 9 10 ....

as they are maximally correlated. However my output comes something like this :

idx = 80 5 3 1 4 7 17 17 17 10 68 78 78 75 9 10 5 1 6 17 .....

I dont understand why this happens.

  1. Am I wrong somewhere ? Isnt the vectors supposed to be maximally correlated in the transformed CCA subspace?
  2. If my above assumption is wrong, please point me out in the correct direction.

Thanks in advance.

First, Let me transpose your code in R2014b:

load carbig;
data = [Displacement Horsepower Weight Acceleration MPG];

% Truncate the data, to follow-up with your sample code
data = data(1:100,:);

nans = sum(isnan(data),2) > 0;
[wx, wy, r, U, V,] = canoncorr(X(~nans,1:3),X(~nans,4:5));

OK, now the trick is that the vectors which are maximally correlated in the CCA subspace are the column vectors U(:,1) with V(:,1) and U(:,2) with V(:,2), and not the row vectors U(i,:), as you are trying to compute. In the CCA subspace, vectors should be N-dimensional (here N=100), and not simple 2D vectors. That's the reason why visualization of CCA results is often quite complicated !

By the way, the correlations are given by the third output of canoncorr, that you (intentionally ?) choosed to skip in your code. If you check its content, you'll see that the correlations (i.e. the vectors) are well-ordered:

r =
    0.9484    0.5991

It is hard to explain CCA better than the link you already provided. If you want to go further, you should probably invest in a book, like this one or this one.