Workers during parfor-loop shutting down/going to sleep

Question

christof on 29 Sep 2017

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/358995-workers-during-parfor-loop-shutting-down-going-to-sleep

Commented: christof on 4 Oct 2017

Dear Matlab-community,

I have an issue with parfor-loops which slowly but steadily drives me crazy. I am currently using the parallel-processing-toolbox for some downscaling experiments of geospatial data. Within the loop, I am loading time-series from two 3D-NetCDF-Files (each time-series represents one single pixel), do some calculations with the data and write the output to another NetCDF-File. The 3D-arrays have the dimensions 697x381x1462 (longitude x latitude x time). In total, I want to process around 150 000 pixels.

When I start to run the code, the parfor-loop runs perfectly. Even the first 50 000 pixels are done within several hours. However, after that, the workers (one after another) go to some sleep-mode until the whole parlor-loop does not iterate any more. When I do a restart (i.e. I keep the already processed pixels and continue with the first "unprocessed" pixel), the code again runs through the first pixels (around 1000), after which the workers are again shutting down or going to sleep.

The duration of each iteration does not change (please look at the attached figure). This holds also true for the I/O-processes (i.e. the reading and writing of the data). Please note that I do not put any data in the memory (i.e. there is no variable which gets bigger and bigger!). All processes within the loop are completely independent from other iterations (i.e. I simply pass a pixel-ID which is then processed).

Even after countless tries, I can not manage to run through the whole dataset, which is absolutely annoying...

The code within the parfor-loop is added below. I would appreciate any help and I am really looking forward to any comments.

Best regards and many thanks in advance, Christof

The image shows 30 of the workers (the different colors and symbols), the duration of each iteration (y-axis) and the actual time (x-axis). Obviously, the workers 1 - 11 seem to stop iterating after less than 8 minutes.

    parfor i = 1:length(ids)

       % Get the ID of the current worker
       t = getCurrentTask();

       % Transform the ID to row- and column-indices
       [rw, clm]  = ind2sub([nlat nlon], ids(i));

       % Synchronize all workers
       labBarrier

       % Add a small delay to each worker --> avoid read-clashes
       pause(t.ID/2);

       % Load data for cal/val
       x_cal = squeeze(ncread(fnme_x_cal, varnme, [clm, rw, 1], ...
                                                               [1 1 Inf]));
       y_cal = squeeze(ncread(fnme_y_cal, varnme, [clm, rw, 1], ...
                                                               [1 1 Inf]));
       x_val = squeeze(ncread(fnme_x_val, varnme, [clm, rw, 1], ...
                                                               [1 1 Inf]));

       % Check if the loaded vector contains real data; if this is not the
       % case, the function replaces the current pixels with missing values
       if ~all(isnan(x_val))
           % Try to execute copula merge; if copula_merge can not be executed, 
           % replace the current pixels with missing values
           try 
               [x_val_out, x_val_std, theta, cpla] = ...
                         copula_merge(x_cal, y_cal, x_val, merge_settings);
           catch
               x_val_out  = 1e+20*ones(length(tme_out), 1);
               x_val_std  = 1e+20*ones(length(tme_out), 1);
               val_ids    = 1e+20*ones(length(tme_out), 1);
               theta      = 1e+20;
               cpla       = 0;
           end
       else
           x_val_out  = 1e+20*ones(length(tme_out), 1);
           x_val_std  = 1e+20*ones(length(tme_out), 1);
           val_ids    = 1e+20*ones(length(tme_out), 1);
           theta      = 1e+20;
           cpla       = 1e+20;  
       end

       % Write the results to the output files
       ncwrite(fnme_out, varnme, x_val_out, [1, rw, clm]);
       ncwrite(fnme_std, [varnme, '_std'], x_val_out, [1, rw, clm]);

       ncwrite(fnme_cpla, 'Copula', cpla, [rw, clm]);
       ncwrite(fnme_theta, 'Theta', theta, [rw, clm]);

end

1 Comment
Show -1 older commentsHide -1 older comments

OCDER on 29 Sep 2017

Edited: OCDER on 29 Sep 2017

When you open up the parpool via something like p = parpool, what does p.IdleTimeout return?

It could be a deadlock created when trying to access the same ncfiles for reading and writing, (eg. fnme_out, fnme_std, fnme_cpla, fnme_theta). Even though the calculations are independent, the resources are the same. https://en.wikipedia.org/wiki/Deadlock

Does the issue persist if you use a regular for loop instead?

Sign in to comment.

Sign in to answer this question.

Answer 1

Edric Ellis on 2 Oct 2017

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/358995-workers-during-parfor-loop-shutting-down-going-to-sleep#answer_283925

Open in MATLAB Online

Firstly, labBarrier has no effect inside a parfor loop. The lab* family of functions only operates within an spmd context.

Without a minimal reproduction, it's pretty difficult to tell what's going wrong. I would add disp statements to your code to try and work out what's happening on the workers when things go wrong.

As @Donald points out, you're trying to write to the same files from each worker. This could easily be a problem. You might consider restructuring to use spmd instead - this will let you ensure the workers don't attempt to access the files at the same time. Something like this:

spmd
  loopUB = numlabs * ceil(length(ids)/numlabs);
  for i = 1:numlabs:loopUB
    if i <= length(ids)
      % actually process the data
    end
    % Here we write the data out, one worker at a time
    for writingLab = 1:numlabs
      if labindex == writingLab && i <= length(ids)
        % actually call 'ncwrite'
      end
      % synchronise to prevent concurrent access
      labBarrier
    end
  end
end

2 Comments
Show NoneHide None

christof on 4 Oct 2017

Open in MATLAB Online

Dear Donald, dear Edric,

thank you very much for your suggestions. I finally have the feeling that I'm in the right place to solve this stupid issue ;)

First of all, I already set p.IdleTimeout to Inf, but that did not solve the problem.

I also have the impression that simultaneous I/O-processes from/to the same files are the real issue. Re-writing the parfor-loop in a sped-statement is indeed a very good idea! Thank you very much for the code. However, what I do not understand is how to tell the workers the ids that they have to process.

Let's assume that I'm running the code on a 10-core machine and I have a total of 100 000 pixels. Then, loopUB = numlabs * ceil(length(ids)/numlabs) = 10 * 100 000/10 = 100 000. The for-loop runs from i = 1 to loopUB (i.e. 100 000) in steps of 10 (--> 1, 11, 21, ...). By now, every worker get's the same indices. I would therefore have to add some getCurrentTask-statement:

 spmd
    loopUB = numlabs * ceil(length(ids)/numlabs);
    for i = 1:numlabs:loopUB
        if i <= length(ids)
            t = getCurrentTask();
            % Add the worker-number to i
            i_act = i + t.ID
            % process ids(i_act) ...       
        end
    end
 end

...or did I misunderstood something?

I am really happy that my issue might be not as unique that I initially thought. Many many thanks for your help and suggestions!

Best regards, Christof

christof on 4 Oct 2017

Open in MATLAB Online

Sorry, it should be

i_act = i - 1 + t.ID

Best regards,

Christof

Sign in to comment.

Workers during parfor-loop shutting down/going to sleep

1 Comment
Show -1 older commentsHide -1 older comments

Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

Workers during parfor-loop shutting down/going to sleep

1 Comment Show -1 older commentsHide -1 older comments

Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None