fminunc : A VERY STRANGE PROBLEM!

Question

0 votes

Hi

I use fminunc to solve a minimization problem. Fminunc make hundeads of calls of a simple function which I optimized for the gpu to improve the performance.

This is what happens when I make a comparison between cpu and gpu for a single call (test is external to fminuc):

TVD=tvd_sim2_mex(x,y, lam, Nit,t);      % 0.018 s    
TVD=tvd_sim2(x,y, lam, Nit,t);          % 0.003 s    6x     

As you can see the performance is 6X faster for the cpu.

The gpu profiling tells me the problem is about gpu malloc.

And this is what happen when I call the function 1000 times:

for i=1:1000
    tic
    TVD=tvd_sim2_mex(x,y, lam, Nit,t);
    mytime(i)=toc;
end                                     % 0.0005 s    6x
TVD=tvd_sim2(x,y, lam, Nit,t);          % 0.003  s    

As you can see the performance is 6X faster for the gpu.

Now....I can't know what exaclty happens inside fminunc function but surly I can say without doubt that

the only difference between the two situation is the function tvd_sim2. No modification to fminunc has been made.

The gpu always delete the memory at the end of every single call.

The function tvd_sim2 is compiled only once before fminunc.

This is what happens when I make a comparison between fminunc using tvd_sim2 and tvd_sim2_mex

(the function tvd_sim launches fminunc):

tic
[y, cost] = tvd_sim(x, lam, Nit,t);   % Run whit tvd_sim2
toc
Solver stopped prematurely.
fminunc stopped because it exceeded the iteration limit,
options.MaxIterations = 5.000000e+01.
Elapsed time is 48.020835 seconds.
and:
tic
[y, cost] = tvd_sim(x, lam, Nit,t);   % Run with tvd_sim2_mex
toc
Solver stopped prematurely.
fminunc stopped because it exceeded the iteration limit,
options.MaxIterations = 5.000000e+01.
Elapsed time is 179.953791 seconds.

In few words.... why does it go slower even if it go faster?

I thought....in my "1000 times for loop" the variable y is always equal but fminunc changes it every time.

This is due to the optimization.

But this is a false problem:

for i=1:1000
    y=rand(4096,1);
    tic
    TVD=tvd_sim2_MEX_mex(x,y, lam, Nit,t);
    mytime(i)=toc;
end
disp('mean time:');
disp(mean(mytime));
mean time:
   5.5624e-04
   

The fact is the gpu reallocate the memory every function call,there's no difference between an equal input or a different one!

I add the screenshots of the function's run with and without the gpu. As you can see all the time is in charge of this function.

Which environmental variable (in the broad sense) can so substantially modify the performance of a gpu running the same code?

Or beyond the evidence is it not the same code?

Thanks!

23 Comments
Show 21 older comments Hide 21 older comments

Emiliano Rosso on 9 Nov 2022

Edited: Emiliano Rosso on 9 Nov 2022

Open in MATLAB Online

I tried to profile this code to have the same metrics:

for i=1:307275
    y=rand(4096,1);
    TVD=tvd_sim2_MEX(x,y, lam, Nit,t);
end
and
for i=1:307275
    y=rand(4096,1);
    TVD=tvd_sim2_MEX_mex(x,y, lam, Nit,t);
end

profiler fminunc

avggputime = 168/307275

avggputime = 5.4674e-04

avgcputime = 31/307275

avgcputime = 1.0089e-04

profiler for loop 307275 times

avggputime = 333/307275

avgvputime = 1.0837 e-03

avgcputime = 37/307275

avgcputime = 1.2041 e-04

fminuncratio = gpu/cpu = 5.4674e-04 / 1.0089e-04 = gpu is 5.42 slower than cpu

forloopration = gpu/cpu = 1.00837e-03 / 1.2041e-04 = gpu is 8.37 slower than cpu

This is a big mistake!

So the only problem was tic toc which gives me a wrong illusion but really gpu performance is naturally slower than cpu?

"Just the cpu suddenly is faster x 30 on FMINUNC (from 3 to 0.1ms), But NOT that the GPU get slower by any stretch (about 0.5 ms)."

Probably this is not true, it's due to the comparison between differents metrics made by a profiler and by tic toc

So I would have solved the mystery simply by dissolving an illusion?

Emiliano Rosso on 9 Nov 2022

Edited: Emiliano Rosso on 9 Nov 2022

Hi

Bruno Luong

ok, but the metrics gave different results and that was what I had to understand, the use of tic toc and the comparisons with different metrics gave unclear results and generated illusions.

That was the reason for my dispute.

I too consider the problem closed and I thank you for your cooperation and patience.

Matt J

do you mean that I have to find a version of fminunc (or another with the same performance) that performs the operations external to tvd_sim2_mex (such as the gradient calculation) without ever translating the intermediate results but leaving them and operating on them in gpuarray?

So the time spent doesn't depend from the phisical data transfer from gpu to workspace but from the translation from gpuarray to double array which is cpu consuming?

I tried fminsearch first but it gives me very long execution time, impossible to use it.

I use 5 unknown variables but two of them are 4096X1, so

does the unknown variables count must be 4096*2 +3 = 8195 ?

Is this the reason why I can't use fminsearch for my problem?

Thanks for all!

Bruno Luong on 9 Nov 2022

Edited: Bruno Luong on 9 Nov 2022

Please read the doc fminunc , especially the party where option 'SpecifyObjectiveGradient' is true.

If you don't provide the gradient MATLAB calls 4000-8000 time the objective to compute the gradients, if you do, then MATLAB do not need to evalutae 4000 time to estimate the gradient but get it from your function. Imagine the time you could save.

Emiliano Rosso on 9 Nov 2022

REMARKABLE !!!

I must take time...

Thanks!

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Matt J on 7 Nov 2022

Edited: Matt J on 7 Nov 2022

0 votes

If I had to guess, the GPU cannot achieve faster speeds because fminunc requires that you pull the results of GPU computation back to the CPU after every call to the objective function. This is because fminunc has to do intermediate computations of its own which must take place on the CPU. It is plausible that the overhead of the CPU-GPU transfers, given the simplicity of your objective, is dominating the computation time.

Why some of your timing experiments do not bear this out is unclear, but as Walter says, it is not clear that your timing methods are valid. tic and toc by themselves are not reliable unless you do something to synchronize the GPU with MATLAB. You should probably be using gputimeit instead, or something in your CUDA code, I guess (__syncthreads()?).

10 Comments
Show 8 older comments Hide 8 older comments

Emiliano Rosso on 8 Nov 2022

Edited: Emiliano Rosso on 8 Nov 2022

Open in MATLAB Online

thanks for answer!

" They don't seem very big to me, only 0.0025 sec. I'd be curious to know what gputimeit() gives, or even just timeit()."

Ok, I tried your suggestion:

t=zeros(1,1);
t=gputimeit(@()tvd_sim2_mex(x,y, lam, Nit,t));         % 0.0012 s
t=zeros(1,1);
for i=1:1000
   t=gputimeit(@()tvd_sim2_mex(x,y, lam, Nit,t));      % 0.0014 s
   tmean(i)=t;
end
disp(mean(t));
t=zeros(1,1);
for i=1:1000
   y=rand(4096,1);
   t=gputimeit(@()tvd_sim2_mex(x,y, lam, Nit,t));      % 0.0012 s
   tmean(i)=t;
end
disp(mean(t));

It seems like the performances are all similar.

According to this new test the gpu is 3X faster respect to the cpu :

TVD=tvd_sim2(x,y, lam, Nit,t); % 0.003 s

and this is in contradiction again with the performance inside fminunc.

Hi Hariprasad Ravishanka , thanks for answer, this is my function:

function [TVD] = tvd_sim2_MEX(x,y, lam, Nit,t)  %#codegen
coder.gpu.kernelfun
[n,m]=size(y);    
diffxx=x(2:n,1)-x(1:n-1,1);
TVD=1/2.*sum(abs(((y-x)./((double(abs(y)-t>0).*y./t)+...
    double(~(double(abs(y)-t>0).*y./t))).^2).^2)) + ...
    lam.*sum(abs(diffxx(2:n-1,1)-diffxx(1:n-2,1)));
end

It's a modified version of the total variation denoising. It's the original formula without numerical approximation and I added some constrain to avoid fminunc takes some vice. So it must be minimized.

"I also noticed that cfg.GpuConfig.EnableMemoryManager was not turned on. Is there a reason for this?"

I tried to set it on true earlier but there's no difference between the two ways.

Matt J on 9 Nov 2022

Edited: Matt J on 9 Nov 2022

Yes. It would also be good to see the code that invokes fminunc, in particular what optimoptions are used.

Emiliano Rosso on 9 Nov 2022

Open in MATLAB Online

Here the code :

function [xden,fval] = tvd_sim(y, lam, Nit,t)
rng default % For reproducibility
[n,m]=size(y);
y0=y;
ObjectiveFunction = @(y) tvd_sim2(y,y0,lam,Nit,t);
options = optimoptions('fminunc','MaxIter',50,'ObjectiveLimit',0,'MaxFunEvals',...
                        Inf,'TolFun',1e-06,'UseParallel',false);
[xden,fval] = fminunc(ObjectiveFunction,y,options);
end
function [TVD] = tvd_sim2(x,y, lam, Nit,t)  %#codegen
coder.gpu.kernelfun
[n,m]=size(y);     % x Nx1 columnwise is denoised
  
diffxx=x(2:n,1)-x(1:n-1,1);
TVD=1/2.*sum(abs(((y-x)./((double(abs(y)-t>0).*y./t)+double(~(double(abs(y)-t>0)...
             .*y./t))).^2).^2)) + lam.*sum(abs(diffxx(2:n-1,1)-diffxx(1:n-2,1)));
end

and this is cpu timeinit:

t=zeros(1,1);
for i=1:1000
    t=timeit(@()tvd_sim2(x,y, lam, Nit,t));      % 0.0014 s
    tmean(i)=t;
end
disp(mean(t));

1.1071e-04

respect to 0.0012.

ratio gpu/cpu= 0.0012 / 1.1071e-04 = 10.83

gpu is x10 slower than cpu.

That's what I'm discovering here:

https://it.mathworks.com/matlabcentral/answers/1845178-fminunc-a-very-strange-problem#comment_2455708

This is a big mistake!

So the only problem was tic toc which gives me a wrong illusion but really gpu performance is naturally slower than cpu?

So I would have solved the mystery simply by dissolving an illusion?

Sign in to comment.

Answer 2

Ram Kokku on 8 Nov 2022

Edited: Walter Roberson on 8 Nov 2022

0 votes

Hi @Emiliano Rosso,

As my colleague Hariprasad mentioned, GPU Coder is a capable of

Allocate memory once and reuse it for subsequent calls. Use cfg.GpuConfig.EnableMemoryManager = true; to enable this.
Take MATLAB gpuArray as input. You are doing this already. But this may not always help. for example, if GPU Coder choices to keep the first use a particular input on CPU (for some reason), it would incur an additional copy.

Further,

you may use gpucoder.profile ( https://www.mathworks.com/help/gpucoder/ref/gpucoder.profile.html ) to find the bottlenecks.
Use of cell arrays and structures may not play will with GPU Coder with regards to copies. consider break the cell array elements to separate variables.
Take a look at the generated code and see GPU Coder is able to parallelize the key piece of your code.
If you are open to share your code, I can take a quick look.

5 Comments
Show 3 older comments Hide 3 older comments

Emiliano Rosso on 8 Nov 2022

Edited: Emiliano Rosso on 8 Nov 2022

Open in MATLAB Online

Hi,thanks for answer

I tried to set :

cfg.GpuConfig.EnableMemoryManager = true;

earlier but there's no difference between the two ways.

this is my function:

function [TVD] = tvd_sim2(x,y, lam, Nit,t)  %#codegen
coder.gpu.kernelfun
[n,m]=size(y);    
diffxx=x(2:n,1)-x(1:n-1,1);
TVD=1/2.*sum(abs(((y-x)./((double(abs(y)-t>0).*y./t)+...
    double(~(double(abs(y)-t>0).*y./t))).^2).^2)) + ...
    lam.*sum(abs(diffxx(2:n-1,1)-diffxx(1:n-2,1)));
end

It's a modified version of the total variation denoising. It's the original formula without numerical approximation and I added some constrain to avoid fminunc takes some vice. So it must be minimized

As for all your other suggestions they will certainly be very useful and I will use them in due course but here is an underlying problem that I have to solve before going through the normal optimization. Apparently it seems that in the same conditions, apart from being inside fminunc, the same function behaves differently and I can't understand why.

Bruno Luong on 8 Nov 2022

Edited: Bruno Luong on 9 Nov 2022

Open in MATLAB Online

sum(abs((diffyx./ycut.^2).^2))

abs has no effect.

Both explicit casting to double is not necessary but it probably does make any harm either.

To summarize I would rather code like this (Warning: not tested)

ycut=(abs(y)-(t>0)).*y./t; % edit missing parenthesis
ycut=ycut+~ycut;
diffxx=x(2:n,1)-x(1:n-1,1);
diffxx=diffxx(2:n-1,1)-diffxx(1:n-2,1);
TVD=1/2.*sum(((y-x)./ycut.^2).^2) + lam.*sum(abs(diffxx));

Emiliano Rosso on 8 Nov 2022

Edited: Emiliano Rosso on 8 Nov 2022

"abs has no effect."

Yes , it's true , I'll modify it !

I've seen your code now, I'll try and verify it!

Thanks!

Sign in to comment.

Answer 3

Bruno Luong on 8 Nov 2022

Edited: Bruno Luong on 8 Nov 2022

0 votes

Just shooting in the dark here and wonder if you let the UseParallel option of fminunc to true or false? It could be that the gradient computation is efficient on CPU but not on GPU depending on this option.

Also your objective function is not very differentiable with all the logical and abs, it could be that fminunc have the hard time to optimize, and time is more sentitive to the numerical truncation, that is differently with gpu-mex and cpu-matlab.

BTW the objective function is simple enough to compute analytic gradient.

11 Comments
Show 9 older comments Hide 9 older comments

Emiliano Rosso on 9 Nov 2022

Edited: Emiliano Rosso on 9 Nov 2022

Open in MATLAB Online

section:

Measure and Improve GPU Performance

https://it.mathworks.com/help/parallel-computing/measure-and-improve-gpu-performance.html#mw_6a8912e8-deec-4b95-96a7-4e7ae69958a8

suggests to use tic toc in this way:

D = gpuDevice;
wait(D)
tic
[L,U] = lu(A);
wait(D)
toc

...if this can help...

Bruno Luong on 9 Nov 2022

"if this can help"

Certainly I'll remember doing the wait the next tic-toc woth GPU code.

Sign in to comment.

fminunc : A VERY STRANGE PROBLEM!

23 Comments
Show 21 older comments Hide 21 older comments

Accepted Answer

10 Comments
Show 8 older comments Hide 8 older comments

More Answers (2)

5 Comments
Show 3 older comments Hide 3 older comments

11 Comments
Show 9 older comments Hide 9 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

fminunc : A VERY STRANGE PROBLEM!

23 Comments Show 21 older comments Hide 21 older comments

Accepted Answer

10 Comments Show 8 older comments Hide 8 older comments

More Answers (2)

5 Comments Show 3 older comments Hide 3 older comments

11 Comments Show 9 older comments Hide 9 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

23 Comments
Show 21 older comments Hide 21 older comments

10 Comments
Show 8 older comments Hide 8 older comments

5 Comments
Show 3 older comments Hide 3 older comments

11 Comments
Show 9 older comments Hide 9 older comments