Can't use as many cores as available
Show older comments
Apologies first because I probably don't know enough about this to adequately describe the problem, but please ask questions (and maybe help me figure out how to answer them).
I'm using my company's computing resources to run Matlab and run scripts with parallel computing. There's some set up, but, as best I understand, once I'm in Matlab, I'm essentially remoted into a computer with a lot of cores, though those are a shared resource. To run a script, I start with parpool('local',<number of cores>). This only works if I request 64 or fewer cores.
Prior to this, I set up the cluster by validating it with 5 cores and then resetting the number of workers to 512, which is the maximum we're allowed. Before setting up the parpool, I have checked the number of cores available with feature('numCores') to ensure I'm not requesting more than available and/or checked the number of idle cores by running cee-lan-status -c in the terminal (I assume this is a standard command, but I don't know bash).
When I request more than 64 cores, I always get this error:
Error using parpool (line 149)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 678)
Failed to initialize the interactive session.
Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 789)
The interactive communicating job failed with no message.
What else can I try?
Answers (1)
Raymond Norris
on 22 Mar 2021
I believe what you're saying is that from your desktop machine you connect to some server. From there, you run MATLAB on a machine that has 512 cores.
You validate your local profile by changing the worker count to 5, then set it back to 512. Note: not sure which version of MATLAB, but there's a field in the validation to state the number of workers to use so that you don't have to toggle this.
You then run the following
nc = feature('numcores');
p = parpool('local',nc);
The caveat is that you can't be sure that you have access to number of cores, c. That's just want MATLAB sees. Are there other applications/users running on the same machine?
I've heard of issues crossing 64 local workers, but I think that was more on Windows and not Linux (which I'm guessing is what you're running on?). To capture the error, try the following:
c = parcluster('local');
p = c.parpool(nc);
% Parpool errors out. Look at log file.
c.getDebugLog(c.Jobs(end));
% After you look at the error, delete the job
c.Jobs(end).delete
7 Comments
Daniel
on 23 Mar 2021
Raymond Norris
on 23 Mar 2021
Sorry, I left off the first line. Try this:
pctconfig('preservejobs',true);
nc = feature('numcores');
c = parcluster('local');
p = c.parpool(nc);
c.getDebugLog(c.Jobs(end));
% Use c.Jobs since presumably 'p' is not a handle to a pool (b/c of error
% above)
c.Jobs(end).delete
cee-lan-status -c is being run on the compute node where the MATLAB session is running, right?
Raymond Norris
on 23 Mar 2021
All of the matlab_crash_dump files should be similar. In the comment section here, you'll see a paperclip. Attach one of the files.
My guess is that it's a ulimit issue. On the machine running MATLAB and the local pool, can you run this from your Linux terminal
% ulimit -a
I believe we've recently fixed MATLAB to "do the right thing" in these cases (lots of workers running on single node with low limits). I can't find it in front of me, but I thought we were supposed to try to increase the ulimit or throw a warning, etc.
If it is a ulimit issue, you might be able to fix this in /etc/security/limits.conf (need to see the impact of others). Alternatively, I've updated the ulimit in $matlabroot/bin/matlab, but not when it's being used by the local scheduler -- this was with other schedulers (e.g. PBS).
Daniel
on 23 Mar 2021
Raymond Norris
on 23 Mar 2021
My only suggestion is to set "max user processes" and "open files" to "unlimited". Best to speak to your adminstrator for how to make the change permenent (and not just in a temporary shell).
Daniel
on 24 Mar 2021
Categories
Find more on Parallel Computing Fundamentals in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!