Improving performance of parallel code sections

Question

1 vote

Consider the following test code which benchmarks different methods of parallelization within MATLAB:

function partest()
    
    delete(gcp("nocreate"));
    pool = parpool("threads", 3);
    njob = pool.NumWorkers;
    ndim = 10000000;
    state = rand(ndim, njob);
    dummy = 0;
    
    % parfor
    tic
    func = zeros(njob, 1);
    parfor ijob = 1 : njob
        func(ijob) = getFunc(state(:, ijob));
    end
    disp("  parfor = " + toc + " seconds.")
    dummy = dummy + sum(func);
    
    % spmd
    tic
    func = zeros(njob, 1);
    spmd
        indx = spmdIndex;
        func(indx) = getFunc(state(:, indx));
    end
    disp("    spmd = " + toc + " seconds.")
    %dummy = dummy + sum([func{:}]);
    
    % parfeval
    tic
    clear func
    fout(1 : njob) = parallel.FevalFuture;
    for ijob = 1 : njob
        fout(ijob) = parfeval(pool, @getFunc, 1, state(:, ijob));
    end
    func = fetchOutputs(fout);
    disp("parfeval = " + toc + " seconds.")
    dummy = dummy + sum(func);
    
    % serial
    tic
    func = zeros(njob, 1);
    for ijob = 1 : njob
        func(ijob) = getFunc(state(:, ijob));
    end
    disp("  serial = " + toc + " seconds.")
    dummy = dummy + sum(func);
    disp(dummy)
    delete(gcp("nocreate"));
end
function func = getFunc(state)
    %func = -sum(state.^2);
    func = 0;
    for idim = 1 : length(state)
        func = func - state(idim)^2;
    end
end

Here is the benchmark output:

  parfor = 0.092802 seconds.
    spmd = 0.044583 seconds.
parfeval = 0.032457 seconds.
  serial = 0.050159 seconds.

Firstly, why is parfeval outperforming all others?

Secondly, are there anything that could be done to any of these parallel constructs to improve their performance against the last (serial) case?

2 Comments
Show None Hide None

Edric Ellis on 1 Feb 2024

Can I ask

What OS are you using?
Which release of MATLAB?
How many physical cores do you have - what's the result of maxNumCompThreads.

A.B. on 1 Feb 2024

WSL, 2023b, 12 per cpu, 12

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Matt J on 31 Jan 2024

Edited: Matt J on 31 Jan 2024

0 votes

The task is too small for parallelization to have any meaningful effect. Similarly, the relative performance numbers for the different methods is not meaningful. You need a task that is at least 30 seconds long serially for any of the comparisons to make any sense.

9 Comments
Show 7 older comments Hide 7 older comments

A.B. on 1 Feb 2024

No, virtually. That was the response to the strangeness issue I mentioned and you disregaded as normal. The real issue which prompted this post is the performance difference between the constructs, which of course diminishes with workload size. I just moved on with what's pragmatic. I'd appreciate your explanation of the phenomenon if you have any.

Matt J on 1 Feb 2024

Edited: Matt J on 1 Feb 2024

What about my earlier comment about having more than 1 loop iteration per worker? It makes intuitive sense to me that parFeval would be optimal when you have only 1 function call per worker. Otherwise, if parfor were optimal for everything, why would they provide parFeval?

Sign in to comment.

Improving performance of parallel code sections

2 Comments
Show None Hide None

Answers (1)

9 Comments
Show 7 older comments Hide 7 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Improving performance of parallel code sections

2 Comments Show None Hide None

Answers (1)

9 Comments Show 7 older comments Hide 7 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

9 Comments
Show 7 older comments Hide 7 older comments