fminunc : A VERY STRANGE PROBLEM!

Hi
I use fminunc to solve a minimization problem. Fminunc make hundeads of calls of a simple function which I optimized for the gpu to improve the performance.
This is what happens when I make a comparison between cpu and gpu for a single call (test is external to fminuc):
TVD=tvd_sim2_mex(x,y, lam, Nit,t); % 0.018 s
TVD=tvd_sim2(x,y, lam, Nit,t); % 0.003 s 6x
As you can see the performance is 6X faster for the cpu.
The gpu profiling tells me the problem is about gpu malloc.
And this is what happen when I call the function 1000 times:
for i=1:1000
tic
TVD=tvd_sim2_mex(x,y, lam, Nit,t);
mytime(i)=toc;
end % 0.0005 s 6x
TVD=tvd_sim2(x,y, lam, Nit,t); % 0.003 s
As you can see the performance is 6X faster for the gpu.
Now....I can't know what exaclty happens inside fminunc function but surly I can say without doubt that
the only difference between the two situation is the function tvd_sim2. No modification to fminunc has been made.
The gpu always delete the memory at the end of every single call.
The function tvd_sim2 is compiled only once before fminunc.
This is what happens when I make a comparison between fminunc using tvd_sim2 and tvd_sim2_mex
(the function tvd_sim launches fminunc):
tic
[y, cost] = tvd_sim(x, lam, Nit,t); % Run whit tvd_sim2
toc
Solver stopped prematurely.
fminunc stopped because it exceeded the iteration limit,
options.MaxIterations = 5.000000e+01.
Elapsed time is 48.020835 seconds.
and:
tic
[y, cost] = tvd_sim(x, lam, Nit,t); % Run with tvd_sim2_mex
toc
Solver stopped prematurely.
fminunc stopped because it exceeded the iteration limit,
options.MaxIterations = 5.000000e+01.
Elapsed time is 179.953791 seconds.
In few words.... why does it go slower even if it go faster?
I thought....in my "1000 times for loop" the variable y is always equal but fminunc changes it every time.
This is due to the optimization.
But this is a false problem:
for i=1:1000
y=rand(4096,1);
tic
TVD=tvd_sim2_MEX_mex(x,y, lam, Nit,t);
mytime(i)=toc;
end
disp('mean time:');
disp(mean(mytime));
mean time:
5.5624e-04
The fact is the gpu reallocate the memory every function call,there's no difference between an equal input or a different one!
I add the screenshots of the function's run with and without the gpu. As you can see all the time is in charge of this function.
Which environmental variable (in the broad sense) can so substantially modify the performance of a gpu running the same code?
Or beyond the evidence is it not the same code?
Thanks!

23 Comments

I do not see a gather() in your gpu timing. timing is not valid without a gather()
I don't use gpuarray.
This is my ARG:
ARGS = cell(1,1);
ARGS{1} = cell(5,1);
ARGS{1}{1} = coder.typeof(0,[mex1 mex2]);
ARGS{1}{2} = coder.typeof(0,[mex1 mex2]);
ARGS{1}{3} = coder.typeof(0);
ARGS{1}{4} = coder.typeof(0);
ARGS{1}{5} = coder.typeof(0);
and this is my codegen:
cfg = coder.gpuConfig('mex');
cfg.GpuConfig.CompilerFlags = '--fmad=false';
cfg.GenerateReport = true;
%cfg.GpuConfig.MallocMode='unified';
cfg.GpuConfig.ComputeCapability='5.2';
%cfg.GpuConfig.EnableMemoryManager = true;
%cfg.GpuConfig.Benchmarking = true;
cfg.MATLABSourceComments=true;
codegen -config cfg tvd_sim2_MEX -args ARGS{1}
I tried many ways :
ARGS{1}{1} = coder.typeof(0,[mex1 mex1]);
ARGS{1}{1} = coder.typeof(1,[mex1 mex1]);
ARGS{1}{1} = coder.typeof(0,[mex1 mex1], 'Gpu', true);
ARGS{1}{1} = coder.typeof(1,[mex1 mex1], 'Gpu', true);
ARGS{1}{1} = zeros(mex1, mex1);
ARGS{1}{1} = zeros(mex1, mex1, 'gpuArray');
but gpuarray is slower.
How large are the inputs you are passing to the MEX? (I.e. what are the values of mex1 and mex2?)
In the case the input sizes are small, I believe Matt J's answer is most relevant. The computation of the MEX would be too small to justify performing the computation on GPU. The generated code (assuming R2020b) contains several kernel launches and cudaMemcpy synchronization calls, and executing these calls will only bottleneck the computation if the amount of computation performed on the GPU is not sufficiently large.
Emiliano Rosso
Emiliano Rosso on 8 Nov 2022
Edited: Emiliano Rosso on 9 Nov 2022
mex1,mex2 are 1,4096
but all the considerations you made have the same weight inside fminunc and outside it....or not?
What's the reason of that difference?
That's what I want to solve.
Thanks!
I discovered all tvd_sim2_mex function calls are made by fminunc(hooo!!!) which calls fminusub which calls few functions of which the most important is lineSearch. All this functions are p code and I can't profile in them.
Does this can add something new?
Thanks!
Bruno Luong
Bruno Luong on 8 Nov 2022
Edited: Bruno Luong on 8 Nov 2022
"I can't profile in them."
Why you want to profile code outside the place where the discrepency is reported by the profiler: the time of tvd_sim2_mex vs tvd_sim2 ?
Furthermore the pfile times are reported globally as selftime, and you can see that the big part is the taken by vd_sim2_mex alone.
@Justin Hontz "The generated code (assuming R2020b) contains several kernel launches and cudaMemcpy synchronization calls, and executing these calls will only bottleneck the computation if the amount of computation performed on the GPU is not sufficiently large."
But that should also happens in the for-loop test, no? But the for-loop clearly shows gpu-mex is faster. Only when i is called from fminunc where GPU is slower.
Yes ,what you say it's true,I'd like only to see in the environment.
"executing these calls will only bottleneck the computation if the amount of computation performed on the GPU is not sufficiently large."
but the amount of computation is the same in both cases.
You understood my problem.
Thanks!
Something I don't understand: in the screenshit the execution times is 31s and 168s respectively for cpu and mex-gpu with 307275 function calls. That makes in average time of each function call of
avggputime = 168.3/307275
avggputime = 5.4772e-04
avgcputime = 31.022/307275
avgcputime = 1.0096e-04
The gpu time is then compatible avec what you measure with the for-loop of 1000 calls;
for i=1:1000
tic
TVD=tvd_sim2_mex(x,y, lam, Nit,t);
mytime(i)=toc;
end % 0.0005 s 6x
TVD=tvd_sim2(x,y, lam, Nit,t); % 0.003 s
Just the cpu suddenly is faster x 30 on FMINUNC (from 3 to 0.1ms), But NOT that the GPU get slower by any stretch (about 0.5 ms).
Do I understand correctly or do I miss something?
You should post exact code of various timeing; etc.... not a snips do we can verify what you are doing.
I tried to profile this code to have the same metrics:
for i=1:307275
y=rand(4096,1);
TVD=tvd_sim2_MEX(x,y, lam, Nit,t);
end
and
for i=1:307275
y=rand(4096,1);
TVD=tvd_sim2_MEX_mex(x,y, lam, Nit,t);
end
profiler fminunc
avggputime = 168/307275
avggputime = 5.4674e-04
avgcputime = 31/307275
avgcputime = 1.0089e-04
profiler for loop 307275 times
avggputime = 333/307275
avgvputime = 1.0837 e-03
avgcputime = 37/307275
avgcputime = 1.2041 e-04
fminuncratio = gpu/cpu = 5.4674e-04 / 1.0089e-04 = gpu is 5.42 slower than cpu
forloopration = gpu/cpu = 1.00837e-03 / 1.2041e-04 = gpu is 8.37 slower than cpu
This is a big mistake!
So the only problem was tic toc which gives me a wrong illusion but really gpu performance is naturally slower than cpu?
"Just the cpu suddenly is faster x 30 on FMINUNC (from 3 to 0.1ms), But NOT that the GPU get slower by any stretch (about 0.5 ms)."
Probably this is not true, it's due to the comparison between differents metrics made by a profiler and by tic toc
So I would have solved the mystery simply by dissolving an illusion?
"but really gpu performance is naturally slower than cpu?"
We and Matt (who is the first of us) have told you that the GPU need to transfert data back and forth and that is not negligible to the arithmetics operations. So there is not surprise.
And think this thread can be close, no more mystery as far as I'm concerned.
Matt J
Matt J on 9 Nov 2022
Edited: Matt J on 9 Nov 2022
We and Matt (who is the first of us) have told you that the GPU need to transfert data back and forth and that is not negligible
However, you might be able to achieve better efficiency if you rewrite your GPU mex so that instead of exchaging input/output with the CPU, it leaves it on the GPU, returning output and accepting input in the form of gpuArrays. This could make it so that the whole iterative loop of the optimization is executed on the GPU, avoiding CPU/GPU transfers.
You would need to find an optimization solver that supports gpuArray input. Optimization Toolbox solvers like fminunc do not, but fminsearch does and there might be some 3rd party unconstrained solvers on the File Exchange. Be mindful, of course, that fminsearch only works well for problems with a small number of unknown variables (<=6).
Matt J
Matt J on 9 Nov 2022
Edited: Matt J on 9 Nov 2022
Incidentally, the fact that Optimization Toolbox solvers do not support GPU data types is something I've lamented and expressed to the MathWorks staff:
Emiliano Rosso
Emiliano Rosso on 9 Nov 2022
Edited: Emiliano Rosso on 9 Nov 2022
Hi
Bruno Luong
ok, but the metrics gave different results and that was what I had to understand, the use of tic toc and the comparisons with different metrics gave unclear results and generated illusions.
That was the reason for my dispute.
I too consider the problem closed and I thank you for your cooperation and patience.
Matt J
do you mean that I have to find a version of fminunc (or another with the same performance) that performs the operations external to tvd_sim2_mex (such as the gradient calculation) without ever translating the intermediate results but leaving them and operating on them in gpuarray?
So the time spent doesn't depend from the phisical data transfer from gpu to workspace but from the translation from gpuarray to double array which is cpu consuming?
I tried fminsearch first but it gives me very long execution time, impossible to use it.
I use 5 unknown variables but two of them are 4096X1, so
does the unknown variables count must be 4096*2 +3 = 8195 ?
Is this the reason why I can't use fminsearch for my problem?
Thanks for all!
Matt J
Matt J on 9 Nov 2022
Edited: Matt J on 9 Nov 2022
do you mean that I have to find a version of fminunc (or another with the same performance) that performs the operations external to tvd_sim2_mex (such as the gradient calculation) without ever translating the intermediate results but leaving them
@Emiliano Rosso To be clear, fminunc will not translate gpuArray results into CPU results. It will throw an error unless your tvd_sim2_mex delivers a CPU double array output. You have to find an alternative optimization routine that will not throw such an error if your mex were to leave its results on the GPU as a gpuArray. There is no real reason why fminunc should require this, that I can see. All the "external operations" that fminunc does are simple matrix algebra operations that gpuArray should already support.
That would be a serious candidate for enhancement request. Of course it should extend to fminon, linprog, quadprog, lsqnonlin, etc...
In the thread I linked to above @Joss Knight said he had "captured" my comments, so I assume that to be as good as an enhancement request. That was 2018, however, and I have not heard of any follow up to it at MAB meetings and such.
Matt J
Matt J on 9 Nov 2022
Edited: Matt J on 9 Nov 2022
does the unknown variables count must be 4096*2 +3 = 8195 ? Is this the reason why I can't use fminsearch for my problem?
@Emiliano Rosso Yes, that's too big. Maybe you can find a nonlinear conjugate gradient solver on the File Exchange or GitHub which uses only M-Coded matrix operations.
Bruno Luong
Bruno Luong on 9 Nov 2022
Edited: Bruno Luong on 9 Nov 2022
@Emiliano Rosso you should think about providing the gradient to fminunc, your model look simple enough to do it without much trouble. It will accelerate significantly regardless CPU/ or GPU implementation of obhective function.
Thanks Matt. I've refreshed their collective memories on the discussion in Optim. However, you should definitely bring this sort of thing up at appropriate MAB sessions.
Emiliano Rosso
Emiliano Rosso on 9 Nov 2022
Edited: Emiliano Rosso on 9 Nov 2022
what do you mean about "providing the gradient to fminunc"?
"your model look simple enough to do it without much trouble"?
Do you want for example I calculate a function of TVD velocity and use it as gradient to give it to fminunc?
Thanks!
Bruno Luong
Bruno Luong on 9 Nov 2022
Edited: Bruno Luong on 9 Nov 2022
Please read the doc fminunc , especially the party where option 'SpecifyObjectiveGradient' is true.
If you don't provide the gradient MATLAB calls 4000-8000 time the objective to compute the gradients, if you do, then MATLAB do not need to evalutae 4000 time to estimate the gradient but get it from your function. Imagine the time you could save.
REMARKABLE !!!
I must take time...
Thanks!

Sign in to comment.

 Accepted Answer

Matt J
Matt J on 7 Nov 2022
Edited: Matt J on 7 Nov 2022
If I had to guess, the GPU cannot achieve faster speeds because fminunc requires that you pull the results of GPU computation back to the CPU after every call to the objective function. This is because fminunc has to do intermediate computations of its own which must take place on the CPU. It is plausible that the overhead of the CPU-GPU transfers, given the simplicity of your objective, is dominating the computation time.
Why some of your timing experiments do not bear this out is unclear, but as Walter says, it is not clear that your timing methods are valid. tic and toc by themselves are not reliable unless you do something to synchronize the GPU with MATLAB. You should probably be using gputimeit instead, or something in your CUDA code, I guess (__syncthreads()?).

10 Comments

I've understood what do you mean but when I run:
for i=1:1000
tic
TVD=tvd_sim2_mex(x,y, lam, Nit,t);
mytime(i)=toc;
end
TVD is returned to the workspace every function call,so when you say :
" fminunc requires that you pull the results of GPU computation back to the CPU after every call to the objective function " ,this is true for my simple external for loop too.
For what about the use of tic toc, the time differences are so big that I think it's not plausible to discard them
only because their use.
Matt J
Matt J on 7 Nov 2022
Edited: Matt J on 7 Nov 2022
For what about the use of tic toc, the time differences are so big that I think it's not palusible to discard them.
They don't seem very big to me, only 0.0025 sec. I'd be curious to know what gputimeit() gives, or even just timeit().
Would it be possible for you to share the tvd_sim2 function so that we can profile it at our end?
I also noticed that cfg.GpuConfig.EnableMemoryManager was not turned on. Is there a reason for this?
Hari
thanks for answer!
" They don't seem very big to me, only 0.0025 sec. I'd be curious to know what gputimeit() gives, or even just timeit()."
Ok, I tried your suggestion:
t=zeros(1,1);
t=gputimeit(@()tvd_sim2_mex(x,y, lam, Nit,t)); % 0.0012 s
t=zeros(1,1);
for i=1:1000
t=gputimeit(@()tvd_sim2_mex(x,y, lam, Nit,t)); % 0.0014 s
tmean(i)=t;
end
disp(mean(t));
t=zeros(1,1);
for i=1:1000
y=rand(4096,1);
t=gputimeit(@()tvd_sim2_mex(x,y, lam, Nit,t)); % 0.0012 s
tmean(i)=t;
end
disp(mean(t));
It seems like the performances are all similar.
According to this new test the gpu is 3X faster respect to the cpu :
TVD=tvd_sim2(x,y, lam, Nit,t); % 0.003 s
and this is in contradiction again with the performance inside fminunc.
Hi Hariprasad Ravishanka , thanks for answer, this is my function:
function [TVD] = tvd_sim2_MEX(x,y, lam, Nit,t) %#codegen
coder.gpu.kernelfun
[n,m]=size(y);
diffxx=x(2:n,1)-x(1:n-1,1);
TVD=1/2.*sum(abs(((y-x)./((double(abs(y)-t>0).*y./t)+...
double(~(double(abs(y)-t>0).*y./t))).^2).^2)) + ...
lam.*sum(abs(diffxx(2:n-1,1)-diffxx(1:n-2,1)));
end
It's a modified version of the total variation denoising. It's the original formula without numerical approximation and I added some constrain to avoid fminunc takes some vice. So it must be minimized.
"I also noticed that cfg.GpuConfig.EnableMemoryManager was not turned on. Is there a reason for this?"
I tried to set it on true earlier but there's no difference between the two ways.
How was this timing obtained:
TVD=tvd_sim2(x,y, lam, Nit,t); % 0.003 s
Ideally, you would use timeit().
Emiliano Rosso
Emiliano Rosso on 8 Nov 2022
Edited: Emiliano Rosso on 8 Nov 2022
I answered to you yet.
Please see the upper comment:
I didn't find a single window to comment directly to you.
Thanks!
Matt J
Matt J on 8 Nov 2022
Edited: Matt J on 8 Nov 2022
Please see the upper comment:
I only see test code there for the GPU version. Was the CPU version timed with timeit()? Incidentally, there is no need to enclose gputimeit() or timeit() within loops. These functions already do their own loops internally.
Hi
CPU version is timed using tic toc,do you want I test it using timeinit() without for loop?
Thanks!
Matt J
Matt J on 9 Nov 2022
Edited: Matt J on 9 Nov 2022
Yes. It would also be good to see the code that invokes fminunc, in particular what optimoptions are used.
Here the code :
function [xden,fval] = tvd_sim(y, lam, Nit,t)
rng default % For reproducibility
[n,m]=size(y);
y0=y;
ObjectiveFunction = @(y) tvd_sim2(y,y0,lam,Nit,t);
options = optimoptions('fminunc','MaxIter',50,'ObjectiveLimit',0,'MaxFunEvals',...
Inf,'TolFun',1e-06,'UseParallel',false);
[xden,fval] = fminunc(ObjectiveFunction,y,options);
end
function [TVD] = tvd_sim2(x,y, lam, Nit,t) %#codegen
coder.gpu.kernelfun
[n,m]=size(y); % x Nx1 columnwise is denoised
diffxx=x(2:n,1)-x(1:n-1,1);
TVD=1/2.*sum(abs(((y-x)./((double(abs(y)-t>0).*y./t)+double(~(double(abs(y)-t>0)...
.*y./t))).^2).^2)) + lam.*sum(abs(diffxx(2:n-1,1)-diffxx(1:n-2,1)));
end
and this is cpu timeinit:
t=zeros(1,1);
for i=1:1000
t=timeit(@()tvd_sim2(x,y, lam, Nit,t)); % 0.0014 s
tmean(i)=t;
end
disp(mean(t));
1.1071e-04
respect to 0.0012.
ratio gpu/cpu= 0.0012 / 1.1071e-04 = 10.83
gpu is x10 slower than cpu.
That's what I'm discovering here:
This is a big mistake!
So the only problem was tic toc which gives me a wrong illusion but really gpu performance is naturally slower than cpu?
So I would have solved the mystery simply by dissolving an illusion?

Sign in to comment.

More Answers (2)

Ram Kokku
Ram Kokku on 8 Nov 2022
Edited: Walter Roberson on 8 Nov 2022
As my colleague Hariprasad mentioned, GPU Coder is a capable of
  1. Allocate memory once and reuse it for subsequent calls. Use cfg.GpuConfig.EnableMemoryManager = true; to enable this.
  2. Take MATLAB gpuArray as input. You are doing this already. But this may not always help. for example, if GPU Coder choices to keep the first use a particular input on CPU (for some reason), it would incur an additional copy.
Further,
  1. you may use gpucoder.profile ( https://www.mathworks.com/help/gpucoder/ref/gpucoder.profile.html ) to find the bottlenecks.
  2. Use of cell arrays and structures may not play will with GPU Coder with regards to copies. consider break the cell array elements to separate variables.
  3. Take a look at the generated code and see GPU Coder is able to parallelize the key piece of your code.
  4. If you are open to share your code, I can take a quick look.

5 Comments

Hi,thanks for answer
I tried to set :
cfg.GpuConfig.EnableMemoryManager = true;
earlier but there's no difference between the two ways.
this is my function:
function [TVD] = tvd_sim2(x,y, lam, Nit,t) %#codegen
coder.gpu.kernelfun
[n,m]=size(y);
diffxx=x(2:n,1)-x(1:n-1,1);
TVD=1/2.*sum(abs(((y-x)./((double(abs(y)-t>0).*y./t)+...
double(~(double(abs(y)-t>0).*y./t))).^2).^2)) + ...
lam.*sum(abs(diffxx(2:n-1,1)-diffxx(1:n-2,1)));
end
It's a modified version of the total variation denoising. It's the original formula without numerical approximation and I added some constrain to avoid fminunc takes some vice. So it must be minimized
As for all your other suggestions they will certainly be very useful and I will use them in due course but here is an underlying problem that I have to solve before going through the normal optimization. Apparently it seems that in the same conditions, apart from being inside fminunc, the same function behaves differently and I can't understand why.
I notice you compute the same expression twice
double(abs(y)-t>0).*y./t
Hi, thanks for answer.
Yes , it's true. It happened when I tried to reduce the function to a single string.
Earlier it was like this :
diffyx=y-x;
ycut=double(abs(y)-t>0).*y./t;
yzerolog=~ycut;
yzero=double(yzerolog);
ycut=ycut+yzero;
diffxx=x(2:n,1)-x(1:n-1,1);
diffxx=diffxx(2:n-1,1)-diffxx(1:n-2,1);
TVD=1/2.*sum(abs((diffyx./ycut.^2).^2)) + lam.*sum(abs(diffxx));
but there's no difference in time execution.
Thanks!
sum(abs((diffyx./ycut.^2).^2))
abs has no effect.
Both explicit casting to double is not necessary but it probably does make any harm either.
To summarize I would rather code like this (Warning: not tested)
ycut=(abs(y)-(t>0)).*y./t; % edit missing parenthesis
ycut=ycut+~ycut;
diffxx=x(2:n,1)-x(1:n-1,1);
diffxx=diffxx(2:n-1,1)-diffxx(1:n-2,1);
TVD=1/2.*sum(((y-x)./ycut.^2).^2) + lam.*sum(abs(diffxx));
Emiliano Rosso
Emiliano Rosso on 8 Nov 2022
Edited: Emiliano Rosso on 8 Nov 2022
"abs has no effect."
Yes , it's true , I'll modify it !
I've seen your code now, I'll try and verify it!
Thanks!

Sign in to comment.

Bruno Luong
Bruno Luong on 8 Nov 2022
Edited: Bruno Luong on 8 Nov 2022
Just shooting in the dark here and wonder if you let the UseParallel option of fminunc to true or false? It could be that the gradient computation is efficient on CPU but not on GPU depending on this option.
Also your objective function is not very differentiable with all the logical and abs, it could be that fminunc have the hard time to optimize, and time is more sentitive to the numerical truncation, that is differently with gpu-mex and cpu-matlab.
BTW the objective function is simple enough to compute analytic gradient.

11 Comments

Also your objective function is not very differentiable with all the logical and abs, it could be that fminunc have the hard time to optimize, and time is more sentitive to the numerical truncation, that is differently with gpu-mex and cpu-matlab.
I don't think so. The profiling results say the objective function was callled the exact same number of times in both cases, and there doesn't appear to be anything that would make the execution time per call dependent on the input arguments.
Yes I agree. I wrote it before look at the screenshot more carefully.
Emiliano Rosso
Emiliano Rosso on 8 Nov 2022
Edited: Emiliano Rosso on 8 Nov 2022
I think you haven't grasped the essential problem yet. It is possible to do many optimizations on this code starting from the use of the single type (x10) and the reduction of memcpy. The main problem for which I created this post is that the same function behaves differently if inside fminunc reversing the performance gain. Any of the suggestions you are giving me and for which I thank you cannot give different results if applied in both cases ... or not?
That's the reason for the title: A VERY STRANGE PROBLEM
Thanks!
Bruno Luong
Bruno Luong on 8 Nov 2022
Edited: Bruno Luong on 8 Nov 2022
Why you said "I haven't grasped the essential problem yet"? I provided the theory of why the mex GPU code runs slower under fminunc despite the fact that is faster in the loop. Isn't that what you want to havve an explanation or do I miss something?
BTW have to profiled mex-cpu againts mex-gpu and matlab-cpu?
And when you are at it could you answer my previous question "do you let the UseParallel option of fminunc to true or false? It could be that the gradient computation is efficient on CPU but not on GPU depending on this option. "
@Emiliano Rosso Have you checked, though, that the results of the fminunc are the same for both implementations? Or at least, the same within reasonable floating point noise differences?
Emiliano Rosso
Emiliano Rosso on 8 Nov 2022
Edited: Emiliano Rosso on 8 Nov 2022
Bruno Luong,
ok, I tried:
gpumex fminunc option 'UseParallel',true : 265s
gpumex fminunc option 'UseParallel',false : 400s
cpu fminunc option 'UseParallel',true : 31s
cpu fminunc option 'UseParallel',false : 54s
mexcpu is the same of cpu probably because mex is present yet in the inner function by Matlab optimization.
As you can see the use of parallel can improve the gradient calculation which is external to tvd_sim2_mex as you told me.
But as you can see in the screeshots I charged in the main post, the most time spent is in tvd_sim2_mex, not in gradient calculation which is in fminunc algorithm. And this is the reason why,even if the use of parallel improves the performance,you can see the big difference between gpu and cpu.
Or I just can't understand what you mean?
Matt J.
Yes,results are identical with a very small approximation .
Thanks!
@Emiliano Rosso thanks for the test, UseParallel is not the cuprit then.
Bruno Luong
Bruno Luong on 8 Nov 2022
Edited: Bruno Luong on 8 Nov 2022
"he most time spent is in tvd_sim2_mex, not in gradient calculation"
I think you still don't understand my theory (which obviously is not right)/
The gradient calculation calls tvd_sim2_mex sequentially or parallelly on top of the GPU parallelization depending on the option. My theory is that there could be some trafic jam if the GPU memory is limited, the data transfererd on the bus, or it is satured in term of maximum threads that the GPU can handle etc... This is different when you test in a for-loop since the function is carried out 1000 time but sequentially.
I don't believe you can looks at the time reported by the profiler and be sure that the function execution is slowed down without taking into account other activities is going on at the same time (in that case fminunc UseParallel option).
ok, I think I finally understood. You mean that, apart from the fact that the profiler is not very reliable in this case, the function opened in the gpu is slowed down by some external saturation process due to a simultaneous (parallel) call of many functions at the same time.
In the case of the cpu this would not happen because I have a greater amount of ram (16 Gb) than the gpu (4 Gb).
Your theory is correct but it must be verified.
Is it possible to profile this with gpuprofiler? How can this situation be improved? How can I make the function run sequentially?
Thank you!
section:
Measure and Improve GPU Performance
suggests to use tic toc in this way:
D = gpuDevice;
wait(D)
tic
[L,U] = lu(A);
wait(D)
toc
...if this can help...
"if this can help"
Certainly I'll remember doing the wait the next tic-toc woth GPU code.

Sign in to comment.

Products

Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!