Matlab Parallel Computing on Cluster - File not found (Task8-32.in.mat)

2 views (last 30 days)
Hi,
I am trying to use Matlab on a cluster with multiple nodes.
For now, I'm trying with 2 nodes of 16 cores each.
I have generated a new Generic Cluster profile using the plugin scripts for sun grid engine (sge).
The independent job validation is working fine, while the spmd, pool and parpool tests fail (only if I use more than 1 node!).
Looking at the job logs, I saw that the problem was related to mw_mpiexec (MPI was crashing).
I tried to use a different mpi -> mpich-4.1.1 and now MPI isn't crashing anymore, however the matlab instances on the different nodes are not able to find the files automatically generated from the validation cases.
I am reporting the log file of the validation attached.
Could you please help me solving this issue?
Thank you,
Antonio

Answers (1)

Raymond Norris
Raymond Norris on 16 May 2023
Hi @Antonio Cioffi. I'm not sure why mpiexec is crashing, but I can tell you why you're getting validation issues. When you switch MPI libraries, you need to point MATLAB to the correct libmpi.so. When you say you've tried different MPI, how did you go about it? You'll need to create your own mpiLibConf.m file to point to your libmpi.so (see the documentation for more info).
The reason I can tell you MATLAB is not loading the correct library is because of the following
[28] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[31] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[30] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[29] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
The [number] is the MPI rank. This is telling you that each worker is creating a file in the folder Job25 with the filename Task1. And they're all "task 1" because they haven't properly started -- they're not aware there are other MPI ranks. What it should show is
[28] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task28"
[31] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task31"
[30] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task30"
[29] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task29"
This is an indication that each worker has started up correctly. Therefore, MATLAB must not be finding the correct libmpi.so.
I would suggest you contact support@mathworks.com and they can help you figure out why SGE can't run multi-node (you have passwordless-ssh between the compute nodes, right?).
  2 Comments
Antonio Cioffi
Antonio Cioffi on 16 May 2023
Hi @Raymond Norris, thank you very much for the reply.
A couple days ago I was able to solve the issue using Matlab's mw_mpiexec.
The problem was that mw_mpiexec was trying to communicate between nodes using ethernet and not infiniband!
So I just had to add the flag -iface ib0 to the mw_mpiexec executable to run everything correctly!
Raymond Norris
Raymond Norris on 18 May 2023
Hi @Antonio Cioffi, could you email me the error message you were getting?
I want to clarify what's happening, though. Notice the following
./mpiexec -info | grep device
will display
--with-device=ch3:nemesis
Therefore, shared memory for intranode communication and TCP for internode (default for nemesis per https://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1-README.txt). If it was built with
--with-device=ch3:nemesis:mxm (Mellanox InfiniBand)
--with-device=ch3:nemesis:ofi
--with-device=ch4:ucx
then traffic would be natively going over IB. Instead, I believe what you are getting is IPoIB.

Sign in to comment.

Categories

Find more on Third-Party Cluster Configuration in Help Center and File Exchange

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!