The parallel pool shut down because the client lost connection to worker

Related to the question asked in https://it.mathworks.com/matlabcentral/answers/2058094-aws-matlab-parallel-server-what-is-the-best-strategy i followed that strategy just like a PoC (28 Simulations - 28 Workers of 32 - 2 Machine of 16 Workers) but I encounter the following error:
"The parallel pool shut down because the client lost connection to worker 21. Check the network connection or restart the parallel pool with 'parpool'"
I don't understand why there's a newtork connection since MATLAB Simulink has been running on AWS Cloud.
Also, how can I continue execution on the remaining parallel workers even if an error occurs?
Thank you very much

8 Comments

Hello,
Could you provide some more information about your setup:
  • Are you on Windows or Linux
  • Are you starting your pool on just one machine, or are you using Parallel Server across multiple nodes?
  • Are you running Simulink simulations? If so, could you share some code on how you're submitting the jobs?
It might also be worth getting some logs to see if there's anything specific causing the workers to lose connection between themselves.
c=parcluster;
setenv('MDCE_DEBUG','true')
pctconfig('preservejobs',true);
% Run your job here
Once your job fails, navigate to the following directory and find the newest folder, which should contain a "Job#.log"
c.JobStorageLocation
This file will be quite long, so feel free to either include any snippets with obvious errors, or to send along the entire log to support@mathworks.com who should be able to help you out further.
Hi Damian,
thank you for your suggestions. In the next run, I'll try to activate the debug mode and find the job#.log
I could provide more information about my setup.
I use a MATLAB Client on an EC2 c4.2xlarge 1 TB storage instance and Windows as operating system. From the MATLAB Client I run AWS Parallel Server from Cloud Center. Here I use a Linux EC2 m5.xlarge Headnode and 2 EC2 a set EC2 c6a.8xlarge instances, with a maximum of 16 workers per machine and 1 TB of storage. Both Headnode e Worker Machines run on Linux.
So, I'm using Parallel Server across multiple nodes.
Yes, I'm running Simulink simulations and yes, I can share the following snippet:
nsimulation = 32;
n=nsimulation;
in(1:n) = Simulink.SimulationInput('Name_Simulation');
for k = nsimulation:-1:1
in(k) = in(k).setBlockParameter('Name_Simulation/V1','Value',num2str(V1(k)));
in(k) = in(k).setBlockParameter('Name_Simulation/V2','Value',num2str(V2(k)));
in(k) = in(k).setBlockParameter('Name_Simulation/V3','Value',num2str(V3(k)));
in(k) = in(k).setBlockParameter('Name_Simulation/V4','Value',num2str(V4(k)));
in(k) = in(k).setBlockParameter('Name_Simulation/V5','Value',num2str(V5(k)));
in(k) = in(k).setBlockParameter('Name_Simulation/V6','Value',num2str(V6(k)));
in(k) = in(k).setBlockParameter('Name_Simulation/V7','Value',num2str(V7(k)));
in(k) = in(k).setBlockParameter('Name_Simulation/V8','Value',num2str(V8(k)));
end
in = in.setModelParameter('LoggingToFile', 'On', 'LoggingFileName', 'out.mat')
parpool("aws_parallel_server")
out = parsim(in, 'UseFastRestart', 'on', 'ShowProgress','on', 'ShowSimulationManager','on');
Thanks for sharing that additional information. Could you also try starting a smaller pool of workers (10 or less) to see if this makes a difference? I'm investigating some other similar issues and am curious if the behavior changes with the number of worker in the pool
I also running 16 simulations with a pool of 16 workers and I use a single machine. In that case everthing it's okay. So, I suppose the issues occur when there's a huge number of workers and machines (in my tries with 26, 28, 32 with two machines) and despite being equally distributed between machines. I hope to find an optimal solution from this scenario.
Since the single machine with the smaller number of workers seems to complete successfully, some further investigation will be necessary to figure out why the other jobs fail. If you do end up reaching out to support@mathworks.com, you can CC me directly (dpietrus@mathworks.com). We will most likely need to take a look at both the client and cluster logs
Hi Damian. I enabled debug mode and MATLAB says that job storage location is on HeadNode (which is a EC2 Linux Istance). Unfortunally I can't establish a connection on a machine that it was created by MATLAB. How can I do this?
Before we try accessing those logs, I have one more thing for you to try since the behavior seems to be similar to an issue I ran into with another user. After starting up your MATLAB client before running any jobs, run the following command:
setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '600')
In this case, we are setting a communication timeout in the cluster to 600 seconds (10 minutes). Try running your code again and let me know how things go. If it's successful, I'll pass it along to our development team
I had a smimilar problem on my MAcBook Air M2 running Sonoma 14.4.1 and MATLAB R2024a (24.1.0.2537033) where a parfor look kept chrashing. After using
setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '600')
some improvement was seen but it still kept chrashing (although after longer time). I then used the following:
setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '6000')
instead, and I have not experienced any chrashes since. The chrash used to happen both when storing data on a local SSD-harddrive and on an external SSH-harddrive connected via USB-C.

Sign in to comment.

Answers (1)

When parallel workers start up, a timeout is in place that assumes the workers will all be ready within 30 seconds of one another. This may not always be enough time depending on your execution environment.
If you happen to look at the debug log, you may see a line similar to:
Timeout expired while calling MPI_Init_thread
We can adjust the timeout by adding an environment variable with a new timeout value. After starting up your MATLAB client before running any jobs, run the following command:
setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '600');
In this case, we are setting a communication timeout in the cluster to 600 seconds (10 minutes). Try opening the pool again and note results. If this fixes the issue, please reach out to support@mathworks.com to see how this can be permanently added to your cluster profile.

Categories

Asked:

on 11 Dec 2023

Answered:

on 26 Feb 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!