How to speed up reading from N-dimensional matrix in a nested loop?

7 views (last 30 days)
Hi,
I've got a piece of code that has to run in many nested loops. In the next-to-most inner loop I need to grab some elements from an N-dimensional matrix (N>=5) and perform some calculations on these values. When running the profiler I found that the bottle neck is reading elements from the N-dimensional matrix. The matrix size will typically be [9,9,9,9,9,100] so even though it is big, it's not extremely large.
I wonder if you guys have any good idea on how to speed up this code? Any ideas would be extremely welcome - I've spent days reading on google and trying different things in the profiler, and the quickest way I've found is to stack all the data into a 3D matrix, and looping through the 3rd dimension of this new matrix to perform the calculations.
See a code example below (in this code there might be ways to get rid of some of the loops, but that won't be the case in my real code since that will have a lot of logic on each level). It takes ~1.6sec to run on my laptop, and of this ~1.35 sec is spent reading from the N-dimensional matrix...
Kindly, Anders
% File: test.m
clear
%%This section can not be changed in my real code).
%Set up some grids:
small_grid=3;
large_grid=4;
H=small_grid;
a=linspace(1,4,large_grid)'; A=length(a);
b=linspace(1,3,small_grid)'; B=length(b);
c=linspace(1,2,small_grid)';C=length(c);
d=linspace(1,4,small_grid)';D=length(d);
e=linspace(1,4,10)';E=length(e);
f=1:10;
[ga gb gc gd ge]=ndgrid(a,b,c,d,e);
I=32;%
fn1=ga./3+gb/.5 +gc./3 +gd./4 +ge/4;%Some function
fn2=ga./1+gb/.1 +gc./10 +gd./4 +ge/.1;%Some other function
soughtValue=zeros(length(f),8);
tic
for h=1:H
for a_=1:A
aa=rand(I,1).*(a(end)-a(1))+a(1);%In the real code each level will have a lot of logic...
for b_=1:B
bb=rand(I,1).*(b(end)-b(1))+b(1);
for c_=1:C
cc=rand(I,1).*(c(end)-c(1))+c(1);
for d_=1:D
dd=rand(I,1).*(d(end)-d(1))+d(1);
for e_=1:E
ee=rand(I,1).*(e(end)-e(1))+e(1);
% Calculate which elements to pick out
before1=floor(rand(length(f),1)*(length(a)-1)+1);
after1=before1+1;
before2=floor(rand(length(f),1)*(length(b)-1)+1);
after2=before2+1;
c_index=floor(rand(length(f),1)*(C-1)+1);
d_index=floor(rand(length(f),1)*(D-1)+1);
% Some parameters
a1=floor(rand(length(f))*(length(f)-1)+1);
a2=floor(rand(size(f))*(length(f)-1)+1);
a3=floor(rand(size(f))*(length(f)-1)+1);
a4=floor(rand(size(f))*(length(f)-1)+1);
%%This is the section that can be changed in order to speed up calculations
for ff=1:length(f)
%I've found that it's quickest to store the data in a 3D
%matrix and then loop over the last dimension of this matrix
%to perform the calculations. But even this is slow!
new_grid(:,:,1)= (fn1(:,:,c_index(ff) ,d_index(ff) ,ff));
new_grid(:,:,2)= (fn1(:,:,c_index(ff)+1,d_index(ff) ,ff));
new_grid(:,:,3)= (fn1(:,:,c_index(ff) ,d_index(ff)+1,ff));
new_grid(:,:,4)= (fn1(:,:,c_index(ff)+1,d_index(ff)+1,ff));
new_grid(:,:,5)= (fn2(:,:,c_index(ff) ,d_index(ff) ,ff));
new_grid(:,:,6)= (fn2(:,:,c_index(ff)+1,d_index(ff) ,ff));
new_grid(:,:,7)= (fn2(:,:,c_index(ff) ,d_index(ff)+1,ff));
new_grid(:,:,8)= (fn2(:,:,c_index(ff)+1,d_index(ff)+1,ff));
for gg=1:8
soughtValue(ff,gg) =...
( a1(ff).*new_grid(before1(ff),before2(ff),gg) ) ...
+( a2(ff).*new_grid(before1(ff),after2(ff),gg) ) ...
+( a3(ff).*new_grid(after1(ff) ,before2(ff),gg) ) ...
+( a4(ff).*new_grid(after1(ff) ,after2(ff),gg) );
end
% The first iteration of the loop will look like this. Notice
% that this is many times slower than using the loop!
% soughtValue(ff,1)=...
% ( a1(ff).*(fn1(before1(ff),before2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a2(ff).*(fn1(before1(ff),after2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a3(ff).*(fn1(after1(ff),before2(ff),c_index(ff) ,d_index(ff) ,ff)) ) ...
% +( a4(ff).*(fn1(after1(ff),after2(ff),c_index(ff) ,d_index(ff) ,ff)) );
end
%%End of code that can be changed
end
end
end
end
end
end
toc

Accepted Answer

Cedric
Cedric on 21 Jan 2013
Hi Anders,
Just by replacing the 8 lines defining new_grid with
d3index = [c_index(ff),c_index(ff)+1] ;
new_grid = cat(3, ...
fn1(:,:,d3index,d_index(ff),ff), ...
fn1(:,:,d3index,d_index(ff)+1,ff), ...
fn2(:,:,d3index,d_index(ff),ff), ...
fn2(:,:,d3index,d_index(ff)+1,ff)) ;
you would make it slightly more efficient. About the loop that follows, are you sure that a1(ff),..,a4(ff) are correct indexing-wise? These are 2D arrays that you are accessing using linear indexing.. which seems weird in this context in the sense that you will get the first column of each matrix with ff in [1,..,numel(f)], which we usually index with e.g. a1(ff,1).
Cheers,
Cedric
  7 Comments
Matt J
Matt J on 22 Jan 2013
Edited: Matt J on 22 Jan 2013
The a1 line should read: a1=floor(rand(size(f)) x (length(f)-1)+1);
If that's the case then all lines a1...a4 are computed with the exact same statement. It then makes more sense to compute a single A matrix rather than 4 separate "a" variables
A=floor(rand(length(f),4)*(length(f)-1)+1);
where now a1=A(:,1), a2=A(:,2), etc...
Anders Österling
Anders Österling on 23 Jan 2013
Hi, I just gave it a shot and it did indeed work out to be 20-25% faster than my orignal code. Many thanks Cedric!!!
(Regarding the a's: you're correct Matt! It's better to create them using one line and one A-matrix. But in my 'real' code this is not possible and thus I left them as they are!

Sign in to comment.

More Answers (2)

Matt J
Matt J on 21 Jan 2013
Edited: Matt J on 21 Jan 2013
You can get rid of the loop over gg
soughtValue(ff,:)= ...
( a1(ff).*new_grid(before1(ff),before2(ff),:) ) ...
+( a2(ff).*new_grid(before1(ff),after2(ff),:) ) ...
+( a3(ff).*new_grid(after1(ff) ,before2(ff),:) ) ...
+( a4(ff).*new_grid(after1(ff) ,after2(ff),:) );
Also, you should avoid repeated matrix indexing operations. If you're going to be reusing portions of an array, save them to a temporary variable, e.g.,
cff=c_index(ff);
dff=d_index(ff);
tmp=[cff,cff+1];
new_grid(:,:,1:2)= (fn1(:,:,tmp ,dff ,ff));
new_grid(:,:,3:4)= (fn1(:,:,tmp ,dff+1,ff));
                  new_grid(:,:,5:6)=  (fn2(:,:,tmp ,dff ,ff));
                  new_grid(:,:,7:8)=  (fn2(:,:,tmp ,dff+1,ff));
  4 Comments
Matt J
Matt J on 21 Jan 2013
Not so surprisingly. PERMUTE is expensive. It wouldn't make sense to do it inside any of the loops. I'm also not sure what this
if all([h,a_,b_,c_,d_,e_]==1)
is for.
Cedric
Cedric on 21 Jan 2013
It is for permuting fn1 and fn2 only at the first overall iteration, just to fulfill the requirement not the bring modifications outside of the block that he define as being open to modifications.
The if all() doesn't slow down significantly the execution as far as I can see.

Sign in to comment.


Matt J
Matt J on 21 Jan 2013
Edited: Matt J on 21 Jan 2013
Alright, well here's my fastest version. With the following, I get a 25% speed-up over the original code.
fn=cat(6,fn1,fn2); %PRIOR TO LOOPS
for ff=1:length(f)
new_grid= fn(before1(ff)+[0,1],before2(ff)+[0,1],...
c_index(ff)+[0,1] , d_index(ff)+[0,1],...
ff,:);
AA(4)=a4(ff);
AA(3)=a2(ff);
AA(2)=a3(ff);
AA(1)=a1(ff);
soughtValue(ff,:)=AA*reshape(new_grid,4,8);
end
  6 Comments
Anders Österling
Anders Österling on 23 Jan 2013
Thanks a lot Matt! The first part of the code is really nice, but it seems like it's a tiny bit slower than Cedrics suggestion. But I will keep it as a concice way of writing it! :)
Regarding the last part: I seems to me that this takes ~twice as much computational time as the 8-iteration-loop. Or am I missing something here?
Cheers, Anders
Matt J
Matt J on 23 Jan 2013
Edited: Matt J on 23 Jan 2013
Hi Anders. Not sure what we're calling "the last part". I definitely find that
soughtValue(ff,:)=AA*reshape(new_grid,4,8);
slows things down, as compared to a for-loop, but I haven't measured whether one is twice as fast as the other. It certainly doesn't make more than a 5% impact on the total code's execution time. My fastest version right now is below. By my timings, it's a few percent slower than Cedric's.
My best guess as to why the for-loop is faster is that, maybe, it doesn't allocate any temporary arrays. Possibly, a matrix-vector multiplication always puts its output in a temporary vector, whereas your for-loop might copy results into soughtValue(ff,gg) in a purely in-place fashion.
A few final remarks on all of this:
  1. Because you're jumping around to random locations in an array pulling out values, there are going to be serious limits on how fast you can get. Memory accesses are fastest when done sequentially.
  2. You're doing lots of redundant calculations throughout the code as whole. It doesn't seem to be a good idea to be calling length(f) and size(f) so repeatedly within a for-loop. Each execution by itself is obviously pretty trivial, but the overhead of all those function calls probably adds up.
%Pre-loop computations
fn=cat(6,fn1,fn2);
new_grid=zeros(4,8);
nf=length(f);
for ff=1:nf
new_grid(:) = fn(before1(ff)+[0,1],before2(ff)+[0,1],...
c_index(ff)+[0,1] , d_index(ff)+[0,1],...
ff,:);
%AA1=a1(ff); AA2=a2(ff); AA3=a3(ff); AA4=a4(ff);
for gg=1:8
soughtValue(ff,gg) =...
a1(ff).*new_grid(1,gg)+ ...
a2(ff).*new_grid(2,gg)+ ...
a3(ff).*new_grid(3,gg)+ ...
a4(ff).*new_grid(4,gg);
end
end

Sign in to comment.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!