Strange performance of MATLAB cuda on matrixes. Any idea?

Question

1 vote

I have recently employed MATLAB CUDA library for some absolutely simple matrix calculations on gpu. But the performance results are very strange. could any body help me understand what exactly is going on and how I can solve the issue. Thanks in advance. Please note that the following codes are run on geforce GTX TITAN black gpu.

assume a0,a1,...a6 be 1000*1000 gpuarrays and U=0.5 and V=0.0

titan = gpuDevice(); tic();

for i=1:10000 a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1)); end

wait(titan); time = toc() the result for time=17.98 seconds

now re-defining a0,a1,...a6 and U and V for employment on cpu and calculating the time needed:

tic();

for i=1:10000 a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1)); end

time= toc() the result for time=0.0098 seconds

therefore more than 1800 times faster on cpu!!!!

then I decided to do the previous calculations on the whole matrix rather than specific elements, and here are the results:

Results for the run on gpu:

titan = gpuDevice(); tic(); for i=1:10000 a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4); end wait(titan); time = toc() the result for time=6.32 seconds which means that the operation on the whole matrix is much faster than on a specific element!

Results for the run on CPU:

tic(); for i=1:10000 a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4); end

time= toc() the result for time=35.2 seconds

AND HERE IS THE MOST SURPRISING RESULT: assuming a0,a1,...a6 and U and V to be just 1*1 gpuarrays and running the following:

titan = gpuDevice(); tic(); for i=1:10000 a6=(0.5.*(a5-a0))-(a1+a2+a3)-(a5.*U./3.0)-(a5.*V./2.0)+(0.25.*a5.*a4); end wait(titan); time = toc() the result for time=7.8 seconds

it is even slower than the corresponding 1000*1000 case!

Unfortunately the line a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1)); is one of the lines among about 100 lines, all in a single for-loop and this line proved itself as a real bottleneck taking about 50% of all calculation time needed! could anybody help me? note that transferring this part of calculations on cpu is not a choice because the bottleneck line is in a for-loop and sending a1,...a6 to cpu and calling the results to gpu in each iteration is much more time consuming. any advice is really really appreciated.

4 Comments
Show 2 older comments Hide 2 older comments

Joss Knight on 11 Dec 2014

Open in MATLAB Online

It would be useful to know a bit more about what the real problem is for you, because your test problems are too abstract to be realistic. The CPU is faster than the GPU at repeated serial scalar operations - that is just as expected. And for input sizes around a particular range, there is no particular reason why the GPU would process one size faster than another - it's at the whim of the launch configuration and GPU scheduler. So this doesn't particularly surprise me:

>> a = gpuArray.rand(1);
>> gputimeit(@() a+a+a+a+a)
ans =
    0.0016
>> a = gpuArray.rand(10);
>> gputimeit(@() a+a+a+a+a)
ans =
   8.8711e-04

Although it is curious and I'd like to find out more, the most important thing is this:

>> a = rand(1000);
>> timeit(@() a+a+a+a+a)
ans =
    0.0093
>> a = gpuArray.rand(1000);
>> gputimeit(@() a+a+a+a+a)
ans =
    0.0011

i.e. the GPU is much faster than the CPU at doing lots of computations at once.

You can eliminate the discrepancies using arrayfun to ensure that the same compiled kernel is being run on all your inputs:

>> a = gpuArray.rand(1);
>> gputimeit(@()arrayfun(@(a) a+a+a+a+a, a))
ans =
   5.6745e-04
>> a = gpuArray.rand(10);
>> gputimeit(@()arrayfun(@(a) a+a+a+a+a, a))
ans =
   5.9045e-04

This implies that the kernel or kernels that are being run by MATLAB to compute a+a+a+a+a in the original case are not ideal and so more affected by input size. I'm not sure why. But it would be good to know more about your problem so I can understand what's really important to you and can point you towards the right solution. As currently stated, my instinct is to say that the code you are running is not suitable for the GPU because it cannot run in parallel - try to pull the computation out of the loop and do everything at once, or if those 'a' values are scalars, don't put them on the GPU in the first place.

ehsan monfared on 12 Dec 2014

Edited: ehsan monfared on 12 Dec 2014

Dear Knight, thank you for your comment. It has been the best answer I have found on www. Actually me real problem is a for-loop consisting multiple lines, doing operations on 4000*4000 matrices. The good news is that most of the operations are done on the whole matrix rather than specific elements and therefore a huge speedup is gained at these lines in comparison to cpu implementation. However, since the code is the solution of a PARTIAL DIFFERENTIAL EQUATION, and the method of solution is iterative, at the end of operations on the whole matrices, I have to correct (redefine) some special elements of the matrices at the corners and boundaries. in conclusion, while a huge speed up is achieved at the most of the lines in the for-loop, the last lines of the loop corresponding to some especial elements correction, has been showed to be very time consuming. Moreover it is important to note that I found it inefficient to send those special lines to cpu and then call the results to gpu at each iteration. for more insight into the code, a very simplified version of the code is presented:

for i=1:10000

RHO=FbarIN0+FbarIN1+FbarIN2;

U=FbarIN1-FbarIN2;

V=FbarIN2-FbarIN0;

UU=U.*U;

VV=V.*V;

UV=U.*V;

U2V2=UU+VV;

Feq0=A0 .*( 1.0-(1.5.*U2V2) );

Feq1=A1 .*( 1.0-(1.5.*U2V2)+(3.0.*U)+(4.5.*UU) );

Feq2=A1 .*( 1.0-(1.5.*U2V2)+(3.0.*V)+(4.5.*VV) );

FbarOUT0=FbarIN0-(OmegaF).*(FbarIN0-Feq0);

FbarOUT1=FbarIN1-(OmegaF).*(FbarIN1-Feq1);

FbarOUT2=FbarIN2-(OmegaF).*(FbarIN2-Feq2);

FbarIN0= FbarOUT0;

FbarIN1(2:NI,:)=FbarOUT1(1:NIM,:);

FbarIN2(:,2:NJ)=FbarOUT2(:,1:NJM) ;

FbarIN1(1,NJ)=FbarIN2(1,NJ) + ( (2.0./3.0).*RHO(1,NJ).*Uwnorth );

FbarIN2(1,NJ)=(0.5.*(RHO(1,NJ)-FbarIN0(1,NJ)))- (FbarIN2(1,NJ)+FbarIN1(1,NJ)+0.5*FbarIN0(1,NJ))-(RHO(1,NJ).*Uwnorth./3.0)+(RHO(1,NJ).*Vwnorth./2.0)- (0.25.*RHO(1,NJ).*BUOYANCE(1,NJ));

end

Joss Knight on 12 Dec 2014

Open in MATLAB Online

1. Can you do the calculation on the whole matrix (without any indexing) and then just index the result, i.e.

temp = (0.5.*(RHO-FbarIN0))- (FbarIN2+FbarIN1+0.5*FbarIN0)-(RHO.*Uwnorth./3.0)+(RHO.*Vwnorth./2.0)- (0.25.*RHO.*BUOYANCE);
FbarIN2(1,NJ) = temp(1,NJ);

...and then let me know whether you got a worthwhile speedup? If that's looking better, we can try using masks to see if that's faster, i.e.

mask = false(size(FbarIN2));
mask(1,NJ) = true;
...
FbarIN2(mask) = temp(mask);

It might not be any faster though.

2. Try storing the values of RHO(1,NJ), FbarIN0(1,NH) etc up front so the matrices don't have to be indexed multiple times.

3. Can you compile the equation into an arrayfun function and see what speedup that gives, both when you pass scalars and when you pass the whole matrix?

4. Show me what you're doing to gather the data back to the CPU and run the scalar computation there. There may be a way to do the gather that is more efficient.

ehsan monfared on 12 Dec 2014

Joss, I am off-campus right now. I will test all the suggestions and let you know the results in a week. Once again I have to thank you for the great help.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

matt dash on 10 Dec 2014

Edited: matt dash on 10 Dec 2014

0 votes

Calculating timings of GPU functions is very tricky business. You should read all about gpu occupancy and block sizes and all that good stuff. The short story is that more data does not always equal longer computation times.

Also, if you are really concerned with performance, you should write your calculations in a .cu file, compile it to a ptx, and call that from Matlab instead of relying on Matlab equations. Read/implement the demo described here to see how much of a difference this makes: Mandelbrot Set Demo

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Strange performance of MATLAB cuda on matrixes. Any idea?

4 Comments
Show 2 older comments Hide 2 older comments

Answers (1)

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Tags

Community Treasure Hunt

Strange performance of MATLAB cuda on matrixes. Any idea?

4 Comments Show 2 older comments Hide 2 older comments

Answers (1)

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

4 Comments
Show 2 older comments Hide 2 older comments

0 Comments
Show -2 older comments Hide -2 older comments