Workers during parfor-loop shutting down/going to sleep

10 views (last 30 days)

Dear Matlab-community,

I have an issue with parfor-loops which slowly but steadily drives me crazy. I am currently using the parallel-processing-toolbox for some downscaling experiments of geospatial data. Within the loop, I am loading time-series from two 3D-NetCDF-Files (each time-series represents one single pixel), do some calculations with the data and write the output to another NetCDF-File. The 3D-arrays have the dimensions 697x381x1462 (longitude x latitude x time). In total, I want to process around 150 000 pixels.

When I start to run the code, the parfor-loop runs perfectly. Even the first 50 000 pixels are done within several hours. However, after that, the workers (one after another) go to some sleep-mode until the whole parlor-loop does not iterate any more. When I do a restart (i.e. I keep the already processed pixels and continue with the first "unprocessed" pixel), the code again runs through the first pixels (around 1000), after which the workers are again shutting down or going to sleep.

The duration of each iteration does not change (please look at the attached figure). This holds also true for the I/O-processes (i.e. the reading and writing of the data). Please note that I do not put any data in the memory (i.e. there is no variable which gets bigger and bigger!). All processes within the loop are completely independent from other iterations (i.e. I simply pass a pixel-ID which is then processed).

Even after countless tries, I can not manage to run through the whole dataset, which is absolutely annoying...

The code within the parfor-loop is added below. I would appreciate any help and I am really looking forward to any comments.

Best regards and many thanks in advance, Christof

The image shows 30 of the workers (the different colors and symbols), the duration of each iteration (y-axis) and the actual time (x-axis). Obviously, the workers 1 - 11 seem to stop iterating after less than 8 minutes.

    parfor i = 1:length(ids)
       % Get the ID of the current worker
       t = getCurrentTask(); 
       % Transform the ID to row- and column-indices
       [rw, clm]  = ind2sub([nlat nlon], ids(i));
       % Synchronize all workers
       labBarrier
       % Add a small delay to each worker --> avoid read-clashes
       pause(t.ID/2);
       % Load data for cal/val
       x_cal = squeeze(ncread(fnme_x_cal, varnme, [clm, rw, 1], ...
                                                               [1 1 Inf]));
       y_cal = squeeze(ncread(fnme_y_cal, varnme, [clm, rw, 1], ...
                                                               [1 1 Inf]));
       x_val = squeeze(ncread(fnme_x_val, varnme, [clm, rw, 1], ...
                                                               [1 1 Inf]));
       % Check if the loaded vector contains real data; if this is not the
       % case, the function replaces the current pixels with missing values
       if ~all(isnan(x_val))
           % Try to execute copula merge; if copula_merge can not be executed, 
           % replace the current pixels with missing values
           try 
               [x_val_out, x_val_std, theta, cpla] = ...
                         copula_merge(x_cal, y_cal, x_val, merge_settings);
           catch
               x_val_out  = 1e+20*ones(length(tme_out), 1);
               x_val_std  = 1e+20*ones(length(tme_out), 1);
               val_ids    = 1e+20*ones(length(tme_out), 1);
               theta      = 1e+20;
               cpla       = 0;
           end
       else
           x_val_out  = 1e+20*ones(length(tme_out), 1);
           x_val_std  = 1e+20*ones(length(tme_out), 1);
           val_ids    = 1e+20*ones(length(tme_out), 1);
           theta      = 1e+20;
           cpla       = 1e+20;  
       end
       % Write the results to the output files
       ncwrite(fnme_out, varnme, x_val_out, [1, rw, clm]);
       ncwrite(fnme_std, [varnme, '_std'], x_val_out, [1, rw, clm]);
       ncwrite(fnme_cpla, 'Copula', cpla, [rw, clm]);
       ncwrite(fnme_theta, 'Theta', theta, [rw, clm]);
    end
  1 Comment
OCDER
OCDER on 29 Sep 2017
Edited: OCDER on 29 Sep 2017
When you open up the parpool via something like p = parpool, what does p.IdleTimeout return?
It could be a deadlock created when trying to access the same ncfiles for reading and writing, (eg. fnme_out, fnme_std, fnme_cpla, fnme_theta). Even though the calculations are independent, the resources are the same. https://en.wikipedia.org/wiki/Deadlock
Does the issue persist if you use a regular for loop instead?

Sign in to comment.

Answers (1)

Edric Ellis
Edric Ellis on 2 Oct 2017
Firstly, labBarrier has no effect inside a parfor loop. The lab* family of functions only operates within an spmd context.
Without a minimal reproduction, it's pretty difficult to tell what's going wrong. I would add disp statements to your code to try and work out what's happening on the workers when things go wrong.
As @Donald points out, you're trying to write to the same files from each worker. This could easily be a problem. You might consider restructuring to use spmd instead - this will let you ensure the workers don't attempt to access the files at the same time. Something like this:
spmd
loopUB = numlabs * ceil(length(ids)/numlabs);
for i = 1:numlabs:loopUB
if i <= length(ids)
% actually process the data
end
% Here we write the data out, one worker at a time
for writingLab = 1:numlabs
if labindex == writingLab && i <= length(ids)
% actually call 'ncwrite'
end
% synchronise to prevent concurrent access
labBarrier
end
end
end
  2 Comments
christof
christof on 4 Oct 2017
Dear Donald, dear Edric,
thank you very much for your suggestions. I finally have the feeling that I'm in the right place to solve this stupid issue ;)
First of all, I already set p.IdleTimeout to Inf, but that did not solve the problem.
I also have the impression that simultaneous I/O-processes from/to the same files are the real issue. Re-writing the parfor-loop in a sped-statement is indeed a very good idea! Thank you very much for the code. However, what I do not understand is how to tell the workers the ids that they have to process.
Let's assume that I'm running the code on a 10-core machine and I have a total of 100 000 pixels. Then, loopUB = numlabs * ceil(length(ids)/numlabs) = 10 * 100 000/10 = 100 000. The for-loop runs from i = 1 to loopUB (i.e. 100 000) in steps of 10 (--> 1, 11, 21, ...). By now, every worker get's the same indices. I would therefore have to add some getCurrentTask-statement:
spmd
loopUB = numlabs * ceil(length(ids)/numlabs);
for i = 1:numlabs:loopUB
if i <= length(ids)
t = getCurrentTask();
% Add the worker-number to i
i_act = i + t.ID
% process ids(i_act) ...
end
end
end
...or did I misunderstood something?
I am really happy that my issue might be not as unique that I initially thought. Many many thanks for your help and suggestions!
Best regards, Christof

Sign in to comment.

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!