parpool use existing slurm job

Question

Frank on 13 Jan 2024

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/2069406-parpool-use-existing-slurm-job

Commented: Damian Pietrus on 17 Jan 2024

If I've already started an interactive Slurm job with 5 nodes, how can I start a parpool using the resources allocated to that existing job?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Damian Pietrus on 16 Jan 2024

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/2069406-parpool-use-existing-slurm-job#answer_1390956

Open in MATLAB Online

Hello Frank,

Unfortunately once you've already started up a Slurm job across 5 nodes, there's no current way to access all of those resources. Instead, we have a few options based on if MATLAB Parallel Server is installed on the cluster or not. You can run the ver command to list all of the products installed on the cluster.

If you do not see MATLAB Parallel Server listed in the ver output, you will only be able to use the resources of one node. You can call c=parcluster('local') and start as many workers in your pool as ther are cores on the machine. If you do see it listed, you can setup a Slurm cluster profile as was mentioned above.

Once you have a Slurm cluster profile setup, you can either submit a batch job like Edric mentioned above or use a Slurm batch script that calls some MATLAB code which opens up a parpool across multiple nodes.

Take the following code as an example. The first script starts up a session on one machine, and asks for just enough resources to launch the MATLAB client. Notice that we are only asking for one machine and one core. This is because the "my_parallel_code" file will be asking for the bulk of the resources in a seaparate call to the scheduler. Also note that the total time for this "outer" job to be long enough for the "inner" parallel job to also queue up and finish.

#!/bin/sh
#SBATCH -n 1                            # 1 instance of MATLAB
#SBATCH --cpus-per-task=1               # 1 core per instance
#SBATCH --mem-per-cpu=4gb               # 4 GB RAM per core
#SBATCH --time=2:30:00                  # 2 hours, 20 minutes
# Add MATLAB to system path
# Modify as needed to load your MATLAB module/version
module load matlab/R2023b
# Run code 
matlab -batch my_parallel_code

Here in "my_parallel_code", you can see that we first call the Slurm cluster profile we called earlier. Next, we call parpool with the total amount of workers that we want. This is going to submit a separate job request to Slurm and should be where you requrest the total bulk of your resources. Finally, we run our parallel code across the multiple workers/nodes that we requested.

function my_parallel_code
% Bring cluster profile into the workspace
c = parcluster('Slurm_profile_name_here');
% Specify total number of workers here.  This can span multiple nodes
if isempty(gcp('nocreate')), c.parpool(50); end
% Actual parallel code here
parfor
    ....
end
% Function end
end

When you look at the squeue output, you should see two jobs -- the outer sbatch job that called MATLAB initially and is using one core and then the inner parpool job (with a name of Job#) that spans multiple nodes.

Let me know if you have any questions!

2 Comments
Show NoneHide None

Frank on 16 Jan 2024

TY, @Damian Pietrus. We do have Parallel Server installed and we're able to launch multi-node Matlab jobs using parpool. =)

In your exampke, since parpool launches a separate job the batch job would then have to wait for the parpool requested resources to be available before the original job could proceed. Depending on how long it take for the parpool resources to be allocated, the batch job could time out before the parpool job could be allocated resources.

Is there maybe a more manual way to launch parallel services on all of the nodes allocated to a running job that could then be "assembled" as a parpool?

Damian Pietrus on 17 Jan 2024

Open in MATLAB Online

Hey Frank,

As you noticed, the walltime of the outer job needs to be long enough for the inner job to sit in the queue, get resources, start up, and then finish. I've found that increasing that value to include a decent buffer is usually adequate, though this may vary depending on how busy the cluster is.

Right now there isn't a way of "assembling" the pool in the way that you've mentioned. However, we can work around this issue by using batch jobs like Edric mentioned. We have two options here.

If you'd like to keep everything on the cluster, your slurm batch script can call an .m file that then calls the MATLAB batch command and exits shortly after. With this option, the "outer" job only stays open long enough to submit the other jobs to the queue. Once they are successfully submitted, the script/job can exit and leave the other jobs in the queue to start on their own. Once they are finished, you can start MATLAB to fetch the results.

% Example batch script
c=parcluster('Slurm_Profile_Here');
job1 = c.batch(@my_parallel_function, num_outputs, {function_inputs}, 'Pool', num_pool_workers);
...
jobN = c.batch(@my_parallel_function, num_outputs, {function_inputs}, 'Pool', num_pool_workers);
disp('Jobs Submitted')
exit

The other option is to setup your own machine for remote job submission. As long as your cluster accepts SSH and SFTP connection, you can basically submit the same batch jobs in the previous example from your own machine, avoiding the "outer" job entirely -- it submits the Parallel Server job to the queue directly. If you're interested in this let me know and I can include some more info.

Sign in to comment.

Answer 2

Venkat Siddarth Reddy on 15 Jan 2024

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/2069406-parpool-use-existing-slurm-job#answer_1389411

Edited: Venkat Siddarth Reddy on 15 Jan 2024

Hi @Frank,

I understand that you are trying to create a parallel pool using the resources of a running Slurm job.

To achieve this,you will need to set up a Slurm profile in MATLAB, as this will enable "parpool" function to access the Slurm cluster.

Additionally, you will need to use the "MATLAB Parallel Server" since you're aiming to utilize multiple nodes of the cluster.

For more information on Slurm profiles and using the MATLAB Parallel Server with Slurm jobs, please refer to the following documentation:

I hope this helps!

1 Comment
Show -1 older commentsHide -1 older comments

Frank on 15 Jan 2024

I only see instructions for profiles that would start a new Slurm job. How can I use the resources allocated to my running Job?

Sign in to comment.

Answer 3

Edric Ellis on 16 Jan 2024

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/2069406-parpool-use-existing-slurm-job#answer_1390591

Open in MATLAB Online

Further to @Venkat Siddarth Reddy's suggestion, what you might want to do is something like this:

clus = parcluster('mySlurmProfile') % set up as per Venkat's post
job = batch(clus, 'myScriptThatUsesParfor', Pool=20);

What that batch command does is:

Launch a job on your SLURM cluster with 21 workers
When the job starts, the first worker becomes a non-interactive "client", and the remaining 20 are connected up as a parpool for use by that client
That first worker executes your program myScriptThatUsesParfor
Any parfor loops etc. inside that script use the 20 workers

Is that the sort of thing you want to achieve?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

parpool use existing slurm job

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (2)

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

parpool use existing slurm job

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (2)

1 Comment Show -1 older commentsHide -1 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments