PARFOR is 10X Slower than FOR

24 views (last 30 days)
Paul Safier
Paul Safier on 16 Jun 2022
Commented: Paul Safier on 21 Jun 2022
I'm trying to understand why my use of the parfor is so much slower than using a for loop.
Both parfor and for implementations require some IO, namely the reading of a png file that is 20 Kb (buildingDesign.png attached).
The code called (getSpacing_HELP.m) takes this png file, converts it into a BW matrix and does several image processing operations on the matrix. The purpose is to calculate the average spacing between objects in an image.
The code shown here is a pared-down, simplified version that only includes the meat of the code. I'm using 36 workers on a Linux machine. Images below show the histogram of times for 1000 calls to getSpacing_HELP.m.
Why would the parfor implementation be so much slower? Any suggestions on ways to speed it up? In actual use, the spacing routine could be called a ~10^8 times, so this time difference is important.
ntot = 1000; % How many runs to test
% Create a directory to place copies of the file in. This is just to
% simulate the actual use of this code for troubleshooting.
mkdir tmp1
for jj = 1:ntot
theName = ['./tmp1/',num2str(jj),'_clipped.png'];
copyfile('./Files/buildingDesign.png',theName);
end
%
resultsMat = zeros([ntot 1]);
timeMat = zeros([ntot 1]);
tic
%parfor k = 1:ntot
for k = 1:ntot
fileName = ['./tmp1/',num2str(k),'_clipped.png'];
[singleResult,theTime] = getSpace_HELP(fileName);
resultsMat(k) = singleResult;
timeMat(k) = theTime;
end
histogram(timeMat), xlabel('Time (s)')
timeAll = toc;
disp(['Time to do all runs: ',num2str(timeAll)])
  6 Comments
Edric Ellis
Edric Ellis on 20 Jun 2022
Aha. I suspect that bwlookup can take advantage of MATLAB's intrinsic multi-threading. If that is the case, then your multithreaded desktop MATLAB process is already taking full advantage of all the cores on your system. The workers in a parallel pool run in single-threaded mode (by default). You can confirm this by either monitoring the processor utilisation of desktop MATLAB using top (or similar); or, you can force your desktop MATLAB into single-threaded mode for comparison purposes by using maxNumCompThreads(1).
Basically, any time your original for-loop code is dominated by stuff that is already multithreaded by MATLAB itself, there is no advantage to using a local parallel pool. You're already fully utilising your machine. In cases like this, you may see benefit from using parfor with multiple remote workers.
Paul Safier
Paul Safier on 21 Jun 2022
Hi @Edric Ellis. I ran some tests using top and it seems like that's what's happening. The default maxNumCompThreads on my machine is 36.
Here is top when MATLAB is idle, i.e. I have not started the clipSpacing_HELP.m code. Looks like there're 151 threads available with just 1 running. Not sure why it's necessarily 151...
Here is top when running the code (with the for loop) with the default thread count of 36. It shows 37 threads are running with 113 sleeping.
Here is top when running the code (with the for loop) after setting numMaxCompThreads(1). It doesn't employ as many threads, but not identical to 1. Not sure why...
Here is top when running the code with the parfor loop. The total thread count goes up, but the running thread count is 1. I believe this is what you stated would happen.
So, I guess this test confirms what you suspected, namely that running the for loop is utilzing multithreading, whereas the parfor loop is not. What's confusing me now, is that a timing comparison of running the for loop with 36 vs 1 thread actually shows that the single thread way is a bit faster (overall for 1000 iterations) and for each iteration, it's about the same time as the multithreading result. If this is true, how can it explain the original issue which is that each iteration with a for loop is 10X faster than each iteration with a parfor loop?
Timing comparison of running the for loop with 36 vs 1 thread is below. Shouldn't the individual iteration time for the single thread test be much longer than for the 36 thread case if it's to explain the 10X delta in iteration times per the original post?

Sign in to comment.

Accepted Answer

Raymond Norris
Raymond Norris on 16 Jun 2022
@Paul Safier I believe the problem is that you're calling nested tic/toc. In clipSpacing_HELP, you call tic on line 16 and toc on line 29. However, in the for-loop you call getSpace_HELP, which calls tic on line 6. This becomes the new start time for the call to toc on line 29 in clipSpacing_HELP.
Conversely, when you call getSpace_HELP in a parfor, since the call to tic happens in another worker, the call to toc in clipSpacing_HELP isn't aware of it, so it still uses the tic on line 16 (which is what you really want the for-loop to do as well).
The solution is to link the tic/toc together with a variable, as such (I'm using t0).
t0 = tic;
%parfor k = 1:ntot
for k = 1:ntot
fileName = ['./tmp1/',num2str(k),'_clipped.png'];
[singleResult,theTime] = getSpace_HELP(fileName);
resultsMat(k) = singleResult;
timeMat(k) = theTime;
end
histogram(timeMat), xlabel('Time (s)')
timeAll = toc(t0);
This way, the call to tic/toc in getSpace_HELP doesn't reset the toc being assigned to timeAll. Make this change and rerun it to see if that gives a more accurate run.
  2 Comments
Paul Safier
Paul Safier on 16 Jun 2022
Hi @Raymond Norris . Good catch. Thanks for seeing that. It seems that error messed up the final time estimate. The time of each call of the getSpace_HELP function was correct (see image). However, fixing the error, as you suggested, revealed that overall, the parfor solution for 1000 iterations takes about 3.8X less time than the serial way (using 36 workers), despite the fact that each individual call to the getSpace_HELP function takes 10X longer with the parfor loop because of the overhead.
I guess the speed up might be ~ (# of workers)/10?
Paul Safier
Paul Safier on 16 Jun 2022
@Raymond Norris and @Steven Lord Is the 10X individual iteration slowdown because of the parfor overhead an insurmountable hit? It's just hard to swallow that conclusion since the workers don't have to communicate with each other and it would seem they only have to do what a serial worker has to do, and can do in a tenth of the time...

Sign in to comment.

More Answers (1)

Steven Lord
Steven Lord on 16 Jun 2022
Have you tried using the parallel profiler to determine what percentage of the time taken by the parfor code is spent on the actual computations and how much on overhead? You could try comparing comparing those parallel profiling results with the results of running the for loop version of the code in the MATLAB Profiler.
In order for your code to gain time when run in parallel, the amount of time you spend in the parallel setup and other overhead must be less than the amount of time that you save by running the iterations in parallel. If your overhead is high and/or the amount of time you save is small, your parfor loop could very well take more time to run than your for loop.
Think of grocery shopping with kids. If sending them over to the cereal aisle saves you two minutes but you have to spend five minutes searching the store to find them afterwards (eventually finding them in the candy or snacks aisle) you would be better off just going to the cereal aisle yourself.
  3 Comments
Torsten
Torsten on 16 Jun 2022
Just out of curiosity:
What if you remove the settings
resultsMat(k) = singleResult;
timeMat(k) = theTime;
in the parfor loop ?
Paul Safier
Paul Safier on 16 Jun 2022
@Torsten I just did now and profiled it. No noticable change in overall time or the proportion that's in overhead (i.e. remoteParallelFunction)...

Sign in to comment.

Categories

Find more on Parallel for-Loops (parfor) in Help Center and File Exchange

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!