Please suggest me to write the parallel loop in MATLAB without the workers getting aborted during the course of execution? Thanks

14 views (last 30 days)
I am facing the issue of workers getting aborted while the code is running parallelly using parfor. It may be due to the non-uniform distribution of workload on the workers. Hence, few workers sit idle once the job allocated to them is finished. I tried setting the idle timeout to be infinity and set the spmd to be false. But it didn't help me. The error I am getting is described below:
[Warning: A worker aborted during the execution of the parfor loop. The parfor loop
will now run again on the remaining workers.]
[> In distcomp/remoteparfor/handleIntervalErrorResult (line 240)
In distcomp/remoteparfor/getCompleteIntervals (line 387)
In parallel_function>distributed_execution (line 745)
In parallel_function (line 577)
In generateantenna_dscale (line 37)]
Can anyone suggest to me how to write my parallel loop such that code is executed uniformly on all the workers without getting them disconnected during the course of execution? How to write a parallel code in MATLAB when the time taken by the iteration is independent and depends on the random number generated inside the parfor loop? A code snippet will be of great help. Thanks in advance

Accepted Answer

Raymond Norris
Raymond Norris on 24 Sep 2021
The entire pool idles out, not a single worker. My guess is that a worker is crashing because of memory. Tell us a bit more
  • the scheduler you're using (local, MJS, generic (e.g. PBS))
  • number of cores per node
  • RAM per node
  • size of the pool you're running
  • size of data being sent back and forth
  3 Comments
Raymond Norris
Raymond Norris on 25 Sep 2021
With 192 GB/node and 40 cores, each worker will be allocated roughly 4-5GB. The recommendation is to allocate 4 GB for MATLAB, without knowing much about the work you're doing. Since you're running "workers" instead of full-blown "matlab", let's scale the 4 GB recommendation to 2 GB. That leaves you with about 100 GB for all 40 workers. Does that give you enough memory to do the job?
To troubleshoot this a bit:
  • Use a system tool (e.g. top on Linux or Task Manager on Windows) to monitor your RAM as you run your local job. This ought to tell you if you're exceeding your memory limits.
  • I'm going to assume you use all 40 workers, and although your work might be time intensive, I'm betting is more memory intensive. My suggestion would be to scale back the 40 workers to, say, 20. This gives each worker twice the amount of memory to work with on the single node. It may take a bit longer than you'd like, but the parfor-loop might also come to completion.
  • ticBytes/tocBytes won't show memory consumption in the parfor, but it will at least help see how much is getting passed back and forth. Initially, scale back the size of the data in your code. Somewhere, either inside or outside of the parfor, you're generating the data. For example
data = rand(100000);
data = fread(fid);
data = ...
  • Find ways to scale back how much your reading, generating, etc. until you get to a manageable size. If you then find you can only process, say, a 1/3 of what you need, start thinking about add more nodes (and therefore access to more memory) and scaling with MATLAB Parallel Server across those nodes.
Kedar Pakhare
Kedar Pakhare on 8 Apr 2023
Have you resolved the issue that you were facing with regard to parallel computing and worker abortion? I am facing almost identical issue and do not know how to debug it.
It will be a great help if you can help me out. Thanks in advance.

Sign in to comment.

More Answers (0)

Categories

Find more on MATLAB Parallel Server in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!