Improving performance of parallel code sections

Consider the following test code which benchmarks different methods of parallelization within MATLAB:
function partest()
delete(gcp("nocreate"));
pool = parpool("threads", 3);
njob = pool.NumWorkers;
ndim = 10000000;
state = rand(ndim, njob);
dummy = 0;
% parfor
tic
func = zeros(njob, 1);
parfor ijob = 1 : njob
func(ijob) = getFunc(state(:, ijob));
end
disp(" parfor = " + toc + " seconds.")
dummy = dummy + sum(func);
% spmd
tic
func = zeros(njob, 1);
spmd
indx = spmdIndex;
func(indx) = getFunc(state(:, indx));
end
disp(" spmd = " + toc + " seconds.")
%dummy = dummy + sum([func{:}]);
% parfeval
tic
clear func
fout(1 : njob) = parallel.FevalFuture;
for ijob = 1 : njob
fout(ijob) = parfeval(pool, @getFunc, 1, state(:, ijob));
end
func = fetchOutputs(fout);
disp("parfeval = " + toc + " seconds.")
dummy = dummy + sum(func);
% serial
tic
func = zeros(njob, 1);
for ijob = 1 : njob
func(ijob) = getFunc(state(:, ijob));
end
disp(" serial = " + toc + " seconds.")
dummy = dummy + sum(func);
disp(dummy)
delete(gcp("nocreate"));
end
function func = getFunc(state)
%func = -sum(state.^2);
func = 0;
for idim = 1 : length(state)
func = func - state(idim)^2;
end
end
Here is the benchmark output:
parfor = 0.092802 seconds.
spmd = 0.044583 seconds.
parfeval = 0.032457 seconds.
serial = 0.050159 seconds.
Firstly, why is parfeval outperforming all others?
Secondly, are there anything that could be done to any of these parallel constructs to improve their performance against the last (serial) case?

2 Comments

Can I ask
  • What OS are you using?
  • Which release of MATLAB?
  • How many physical cores do you have - what's the result of maxNumCompThreads.
WSL, 2023b, 12 per cpu, 12

Sign in to comment.

Answers (1)

Matt J
Matt J on 31 Jan 2024
Edited: Matt J on 31 Jan 2024
The task is too small for parallelization to have any meaningful effect. Similarly, the relative performance numbers for the different methods is not meaningful. You need a task that is at least 30 seconds long serially for any of the comparisons to make any sense.

9 Comments

A.B.
A.B. on 1 Feb 2024
Edited: A.B. on 1 Feb 2024
Thank you. But the results are consistent with different runs. For very large ndim values, I observe that parfeval and spmd become comparable. Still, I do not understand why parfor would perform worse than the two other. I had read elsewhere that parfor would be the fastest of all.
Here is the results with `ndim = 10^8`:
Starting parallel pool (parpool) using the 'threads' profile ...
Connected to parallel pool with 12 workers.
parfor = 2.7297 seconds.
spmd = 0.77624 seconds.
parfeval = 0.72941 seconds.
serial = 1.6395 seconds.
and with 2 cores:
Starting parallel pool (parpool) using the 'threads' profile ...
Connected to parallel pool with 2 workers.
parfor = 0.53685 seconds.
spmd = 0.18644 seconds.
parfeval = 0.20128 seconds.
serial = 0.27341 seconds.
Even stranger phenomenon is the fact that the serial execution time depends on the size of the pool. So somthing unholy is likely happening here.
Even stranger phenomenon is the fact that the serial execution time depends on the size of the pool.
That part at least is not strange. As the size of the pool grows, your state array grows. There is more work to do.
I think part of what you see is that there is only 1 loop iteration per worker. If you create a situation where spmd must use a for-drange loop, with multiple loop iterations per worker, you might see better relative performance of parfor.
You are right. I missed the growing array size in the serial code. I ended up using parfor in the production code because the performance penalty (compared to parfeval) appears negligible for significant workloads (> 0.1s). Unlike parfor, I do not know how to measure the cost of individual function calls when using parfeval, which is another important factor in my application to prefer parfor (is there a way to also measure it for parfeval?)
Matt J
Matt J on 1 Feb 2024
Edited: Matt J on 1 Feb 2024
You are right.
Really? Then your problem is solved?
No, virtually. That was the response to the strangeness issue I mentioned and you disregaded as normal. The real issue which prompted this post is the performance difference between the constructs, which of course diminishes with workload size. I just moved on with what's pragmatic. I'd appreciate your explanation of the phenomenon if you have any.
Matt J
Matt J on 1 Feb 2024
Edited: Matt J on 1 Feb 2024
What about my earlier comment about having more than 1 loop iteration per worker? It makes intuitive sense to me that parFeval would be optimal when you have only 1 function call per worker. Otherwise, if parfor were optimal for everything, why would they provide parFeval?

Sign in to comment.

Categories

Products

Release

R2023b

Asked:

on 31 Jan 2024

Edited:

on 1 Feb 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!