Can't use as many cores as available

Apologies first because I probably don't know enough about this to adequately describe the problem, but please ask questions (and maybe help me figure out how to answer them).
I'm using my company's computing resources to run Matlab and run scripts with parallel computing. There's some set up, but, as best I understand, once I'm in Matlab, I'm essentially remoted into a computer with a lot of cores, though those are a shared resource. To run a script, I start with parpool('local',<number of cores>). This only works if I request 64 or fewer cores.
Prior to this, I set up the cluster by validating it with 5 cores and then resetting the number of workers to 512, which is the maximum we're allowed. Before setting up the parpool, I have checked the number of cores available with feature('numCores') to ensure I'm not requesting more than available and/or checked the number of idle cores by running cee-lan-status -c in the terminal (I assume this is a standard command, but I don't know bash).
When I request more than 64 cores, I always get this error:
Error using parpool (line 149)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 678)
Failed to initialize the interactive session.
Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 789)
The interactive communicating job failed with no message.
What else can I try?

Answers (1)

I believe what you're saying is that from your desktop machine you connect to some server. From there, you run MATLAB on a machine that has 512 cores.
You validate your local profile by changing the worker count to 5, then set it back to 512. Note: not sure which version of MATLAB, but there's a field in the validation to state the number of workers to use so that you don't have to toggle this.
You then run the following
nc = feature('numcores');
p = parpool('local',nc);
The caveat is that you can't be sure that you have access to number of cores, c. That's just want MATLAB sees. Are there other applications/users running on the same machine?
I've heard of issues crossing 64 local workers, but I think that was more on Windows and not Linux (which I'm guessing is what you're running on?). To capture the error, try the following:
c = parcluster('local');
p = c.parpool(nc);
% Parpool errors out. Look at log file.
c.getDebugLog(c.Jobs(end));
% After you look at the error, delete the job
c.Jobs(end).delete

7 Comments

Most of what you said is correct. There are others on the machine, but I can see how many idle cores there are by running cee-lan-status -c from the terminal. Also, not that it matters, but some servers have more than 512 cores, but 512 is the max our company allows one person to use at once. This is on Linux.
I tried your code and requested 128 cores. It failed to start the parallel pool, but I got an error when requesting the debug log. It says that it must be invoked with either a single communicating job or a single task of an independent job. Can you help me with that error? Thanks.
Sorry, I left off the first line. Try this:
pctconfig('preservejobs',true);
nc = feature('numcores');
c = parcluster('local');
p = c.parpool(nc);
c.getDebugLog(c.Jobs(end));
% Use c.Jobs since presumably 'p' is not a handle to a pool (b/c of error
% above)
c.Jobs(end).delete
cee-lan-status -c is being run on the compute node where the MATLAB session is running, right?
Daniel
Daniel on 23 Mar 2021
Edited: Daniel on 23 Mar 2021
So that worked, but the output appears to be longer than what fits in the command window. Is there something I can find in there to share? Apparently it creates many matlab_crash_dump files. Would there be something in one of those?
Also, I run cee-lan-status -c in the terminal before ssh'ing into one of the servers, so it shows me all of them.
Thanks for your continued help on this. It's been really frustrating trying to get this to work and my company HPC support keeps sending me back to Matlab help.
EDIT: I attached the most recent matlab crash file.
All of the matlab_crash_dump files should be similar. In the comment section here, you'll see a paperclip. Attach one of the files.
My guess is that it's a ulimit issue. On the machine running MATLAB and the local pool, can you run this from your Linux terminal
% ulimit -a
I believe we've recently fixed MATLAB to "do the right thing" in these cases (lots of workers running on single node with low limits). I can't find it in front of me, but I thought we were supposed to try to increase the ulimit or throw a warning, etc.
If it is a ulimit issue, you might be able to fix this in /etc/security/limits.conf (need to see the impact of others). Alternatively, I've updated the ulimit in $matlabroot/bin/matlab, but not when it's being used by the local scheduler -- this was with other schedulers (e.g. PBS).
I attached a matlab crash file to my previous comment. The output of ulimit -a is:
-bash-4.2$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 17801888
max locked memory (kbytes, -l) 113320732
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65536
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I didn't really understand what you were saying about ulimit and possibilities there. If this confirms that for you, will you let me know what I should tell those in charge so they can fix it? Thanks again for your help!
My only suggestion is to set "max user processes" and "open files" to "unlimited". Best to speak to your adminstrator for how to make the change permenent (and not just in a temporary shell).
If that doesn't fix it, contact Technical Support (support@mathworks.com) to see if they can help.
OK, I've passed that on to my administrator. If they can (or are willing) to change anything, I'll update here as to how it goes. Thanks for your help!

Sign in to comment.

Categories

Products

Release

R2020b

Asked:

on 22 Mar 2021

Commented:

on 24 Mar 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!