How to vectorize nested for loops that compute pair wise correlations using indexed vectors?
Show older comments
Hi,
I would like to vectorize two for loops in the following function, which returns the indices of those rows that have pair wise correlations less than a threshold.
function idx_pairs = get_indices_pair_corr(data, idx, idx2, tau)
idx_pairs = [];
for i=1:length(idx)
for j=1:length(idx2)
if pearson_corr_coeff(data(idx(i),:), data(idx2(j),:)) < tau
idx_pairs = [idx_pairs; [idx(i) idx2(j)]];
end
end
end
- data is a (MxN) matrix
- idx, idx2 are vector of unequal/equal lengths
- tau is a scalar.
- pearson_corr_coeff(...) returns a scalar
If it is not possible to vectorize, please explain why so that I can learn more about the cases when vectorization is not possible.
thnaks
Answers (1)
There might be a better way, but here's how I might go about it.
function idx_pairs = get_indices_pair_corr(data, idx, idx2, tau)
% make sure vectors of indices are column vectors
idx = idx(:);
idx2 = idx2(:);
% lengths of indices
n = length(idx);
n2 = length(idx2);
% extract the rows of data corresponding to idx and then transpose so that
% columns correspond to different variables, and rows to observations
data_red1 = data(idx,:).';
% normalize data by subtracting the column mean, and then dividing by the
% column standard deviation; now each column has mean 0 and std 1
data_red1 = (data_red1 - mean(data_red1,1))./std(data_red1,1,1);
% do the same manipulations for data corresponding to idx2
data_red2 = data(idx2,:).';
data_red2 = (data_red2 - mean(data_red2,1))./std(data_red2,1,1);
% here we reshape data_red1 a 3D array, so that if originally data_red1
% was an NxM matrix, it's now an Nx1xM array
sz = size(data_red1);
reshdata_red1 = reshape(data_red1,[sz(1),1,sz(2)]);
% the (i,j,k) element of the 3D array X will be the product of
% data_red1(i,k) and data_red2(i,j)
X = data_red2.*reshdata_red1;
% now we compute correlation coefficients by taking the mean along the
% first dimension of those products, and then reshape the resulting 1xn2xn
% array to a n2xn matrix, the (j,k)-th element of which is the correlation
% coefficient between data_red1(:,k) and data_red2(:,j)
crs = reshape(mean(X,1),[n2,n]);
% locate correlations that are less than tau (NOTE: are you sure you don't
% want to check whether the ABSOLUTE VALUE of the correlation coefficient
% is less than tau? If so, replace crs with abs(crs) here.)
tst = (crs < tau); % logical matrix with (j,k) element 'true' if correlation
% between data_red1(:,k) and data_red2(:,j) is less than tau
[i,j] = find(tst); % get row/column indices associated with the true values
idx_pairs = [idx(j(:)),idx2(i(:))]; % convert to indices from the original data matrix
Depending on the sizes of the arrays you're using, this approach may be much faster than your loop version. For example, if data is 1000x1000, and idx and idx2 each have 100 elements in them, then your code (slight modified because I don't have this function pearson_corr_coeff) took about 1 second to run on my machine, whereas my code took about 0.02 seconds.
8 Comments
Wasim Aftab
on 26 Jun 2020
Edited: Wasim Aftab
on 26 Jun 2020
Dana
on 26 Jun 2020
You're right that the indices found in [i,j] = find(tst) are relative to the tst matrix, not the data matrix. However, this is corrected at the next step:
idx_pairs = [idx(j),idx2(i)];
So if tst(i,j)=true, then it means the correlation between data_red1(:,k) and data_red2(:,j) is less than tau, which in turn means row idx(j) and row idx2(i) of data have correlation less than tau, and therefore we want to include the pair [idx(j),idx2(i)] in idx_pairs.
Now, playing around with it, it turns out there are potentially some minor discrepancies between the results of my code and yours. However, this is due to a minor error in my computation of the correlation coefficient. The code I posted before incorrectly divides by the unbiased estimate of the standard deviation,
rather than the biased one,
If you replace std(data_red1,0,1) and std(data_red2,0,1) where they appear by std(data_red1,1,1) and std(data_red2,1,1), respectively, you should now find that my code and your code produce exactly the same answers..
Wasim Aftab
on 26 Jun 2020
Edited: Wasim Aftab
on 26 Jun 2020
Dana
on 26 Jun 2020
No, that's not possible. i and j are the result of [i,j] = find(tst). These are vectors such that i(k) and j(k) are, respectively, the row and column indices corresponding to the location of the k-th true in the matrix tst. The function find here will never return vectors i and j with different numbers of elements. So if the code failed, that can't be the reason.
Can you provide more information on when you observed this code failure, and what exactly you mean by a failure (i.e., did it report an error, or just output that you didn't expect, and if so what?)? I've run it probably 20-30 times myself with randomly generated matrices and never had a failure.
Wasim Aftab
on 26 Jun 2020
Edited: Wasim Aftab
on 26 Jun 2020
Dana
on 26 Jun 2020
Okay, so first of all, when you hovered your mouse over idx(j) and idx2(i), it was showing you the contents and dimensions of idx and idx2, not the dimensions of idx(j) and idx2(i). My guess is if you do the same thing, but actually check the i and j vectors, you'll find that they're the same.
Playing with it, I think the issue is that the code I wrote assumed that idx and idx2 both had more than 1 element in them. Your error popped up precisely in the case where idx2 only had a single element. The error happens because of a combination of weird conventions that MATLAB uses (not worth getting into), but in any case, if you replace the last line by:
idx_pairs = [idx(j(:)),idx2(i(:))];
that should hopefully fix the problem.
Wasim Aftab
on 7 Jul 2020
As far as I can tell, the most obvious direction to try to improve surrounds your array neg. It looks like this array is contained within a loop, and each time through the loop you're adding more rows to it. This is very slow. Basically, each time through the loop, MATLAB has to create a whole new array with the new dimensions of neg will be, copy the information in the old neg into the first bunch of rows, and then put the new information in the new rows. That's a lot of copying, and MATLAB is actually probably warning you in the code editor that this not a great idea (I'm guessing neg has a squiggly orange line underneath it, and if you hover your mouse over it tells you that it's changing size on each iteration, and that you should consider pre-allocating).
Depending on whether or not you know ahead of time roughly how many rows neg will end up having, I see two options that could potentially speed this up.
Option 1: # of rows known
If you know in advance how many rows neg will end up having, you can simply pre-allocate neg with the correct number. So for example, you can do something like the following:
neg = zeros(nrws,81,2); % pre-allocation, where nrws is the known # of rows
ctr = 0; % counter to track how many rows of neg have already been assigned
for j = 1:n % n here is the total number of loops
% < put code here to get idx_pairs_bad_corr and data for iteration j >
pos = creat_pairs(idx_pairs_bad_corr, data); % get pairs for iteration j
nnew = size(idx_pairs_bad_corr,1); % number of new rows to be added to neg
neg(ctr+1:ctr+nnew,:,:) = pos; % assign rows of neg for iteration j
ctr = ctr+nnew; % advance counter
end
Option 2: # of rows unknwon
If you don't know how many rows neg could have in general, you can do something like the following:
negcll = cell(n,1); % j-th entry of this cell array will hold the matrix pos for j-th iteration
nnew = zeros(n,1); % vector to hold number of new rows at j-th iteration
for j = 1:n % n here is the total number of loops
% < put code here to get idx_pairs_bad_corr and data for iteration j >
% put pairs for iteration j in j-th element of cell array negcll
negcll{j} = creat_pairs(idx_pairs_bad_corr, data);
nnew(j) = size(idx_pairs_bad_corr,1); % number of new rows at j-th iteration
end
neg = zeros(sum(nnew),81,2); % pre-allocate neg with correct total number of rows
ctr = 0; % counter to track how many rows of neg have already been assigned
for j = 1:n
neg(ctr+1:ctr+nnew(j),:,:) = negcll{j}; % assign rows of neg for iteration j
ctr = ctr+nnew(j); % advance counter
end
EIther option above should make a noticeable improvement in speed, though just how much depends on a number of factors. I'd be curious to hear how much of a difference it makes in your case.
Categories
Find more on Creating and Concatenating Matrices in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!