how parfor distributes resources to each iteration if the number of cores is larger than the number of iterations

12 views (last 30 days)
Here is the situation: I want to perform a few iterations using parfor on a cluster where I can get access to more but finite cores than the number of iterations. Given that each iteration consumes very much time and memory, say the total memory of 10 cores, but the average number of cores allocated for each iteration is less than 10, how should i configure the parfor to avoid the out-of-memory error? Does the parfor distributes resources to each iteration evenly?

Accepted Answer

Damian Pietrus
Damian Pietrus on 17 Jan 2024
Since you mentioned working on a cluster, the first thing I'd like to address is making sure that you are requesting enough resources from the scheduler. If you are using the default values or requesting too little memory, this could be the cause of the out of memory errors you are seeing rather than the way parfor distributes iterations.
For example, if using the Slurm scheduler, you can use the --mem-per-cpu flag to request a certain amount of memory per core for your job. As an example, if you request 10 GB per cpu and use 10 workers in your parpool, then you will request a total of 100GB of memory on the compute node. (10GB x 10 cpu = 100 GB total). Increase the memory request as needed until your job has the memory it needs.
You may need to take a look at the documentation for your cluster to see what types of nodes are available and how much memory each node has. Often clusters have "high mem" nodes in addition to standard nodes.
Give that a shot first and let me know how it goes.
  2 Comments
XYC
XYC on 18 Jan 2024
Thanks very much for your reply. Before I post this question, i did some tests using the default mem-per-cpu. Let's say I have 5 iterations (each iteration needs 21GB memory) to perform using parfor, and I request 16 cores (each core has 4GB). It is OK if I request 6 cores (total 24GB>21GB) to perform each iteration sequentially, but performing 5 iterations using parfor with all 16 cores invokes out-of-memory error because parfor will perform all of them simultaneously (total 64GB requested < total 105GB needed).
I wonder if there is a way to tell parfor to perform first three iterations out of the total 5 simultaneously and then the rest two. Or can i somehow specify the number of cores distributed to each iteration and thus prevent the parfor performs them simultaneously.
All in all, I will give it a try first to increase the memory per cpu.
XYC
XYC on 18 Jan 2024
I try to change the --mem-per-cpu flag but got the following error:
sbatch: error: Batch job submission failed: Job submission failed because too much memory was requested relative to the number of CPUs requested. The requested memory:CPU should be kept no more than DefMemPerCPU.
It seems I have no choice but to request more cores.

Sign in to comment.

More Answers (2)

Edric Ellis
Edric Ellis on 18 Jan 2024
When a parfor loop executes on a parallel pool, PCT will divide up the entire loop range 1:N into a series of batches known as "sub-ranges". This division depends on the total number of iterates in the full range, and the number of workers available. By default, PCT tries to send batches that are big enough that communication costs are minimised, but also small enough that in the case where different iterations take different amounts of time, the workers are kept as busy as possible.
You can override this default division by using parforOptions. For example, to force parfor to send iterates individually, you can do this:
pfo = parforOptions(gcp(), RangePartitionMethod="fixed", SubrangeSize=1);
parfor (i = 1:10, pfo)
out(i) = feature('getpid');
end
out'
  3 Comments
Edric Ellis
Edric Ellis on 19 Jan 2024
Yes, with that scheme, if there are more workers than iterations, you will not keep them all busy. (Sorry, somehow I overlooked that part of your question - I was mostly answering to clarify exactly how the iteration batching works).
I think the solution here is going to involve working with your cluster configuration to ensure the workers you get each have enough memory to run an iteration, rather than anything related to parfor specifically.
One way to head towards what I think you need is to use multi-threaded workers. From the MATLAB client side, you specify the NumThreads property in your parallel.Cluster profile. This needs support from your cluster integration scripts (I'm not an expert here I'm afraid). The result though is that you can end up with a parallel pool where each worker process is multi-threaded. This should have the desired side-effect that each worker process has more memory available. Also, if your time-consuming function happens to take advantage of MATLAB's intrinsic multithreading (e.g. large matrix operations), then that will also be a benefit.

Sign in to comment.


Matt J
Matt J on 18 Jan 2024
Edited: Matt J on 18 Jan 2024
You have to consider the number of workers, M, not just the number of cores, C.
If M is the number of parpool workers and N is the number loop iterations, then each worker will be assigned a consecutive subsequence of N/M iterations, which it will run serially. So, the amount of memory each worker will try to consume is the amount of memory used by N/M of your loop iterations, whatever that is.
The amount of memory each worker has available to it is RAMTotal/M.
Obviously, this math becomes slightly more complicated if N/M or C/M are not integers. Maybe also if your cores are tied up by other Apps besides Matlab.
  8 Comments

Sign in to comment.

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Tags

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!