The parallel pool shut down because the client lost connection to worker

Question

0 votes

Related to the question asked in https://it.mathworks.com/matlabcentral/answers/2058094-aws-matlab-parallel-server-what-is-the-best-strategy i followed that strategy just like a PoC (28 Simulations - 28 Workers of 32 - 2 Machine of 16 Workers) but I encounter the following error:

"The parallel pool shut down because the client lost connection to worker 21. Check the network connection or restart the parallel pool with 'parpool'"

I don't understand why there's a newtork connection since MATLAB Simulink has been running on AWS Cloud.

Also, how can I continue execution on the remaining parallel workers even if an error occurs?

Thank you very much

8 Comments
Show 6 older comments Hide 6 older comments

Corrado on 12 Dec 2023

Edited: Corrado on 12 Dec 2023

Open in MATLAB Online

Hi Damian,

thank you for your suggestions. In the next run, I'll try to activate the debug mode and find the job#.log

I could provide more information about my setup.

I use a MATLAB Client on an EC2 c4.2xlarge 1 TB storage instance and Windows as operating system. From the MATLAB Client I run AWS Parallel Server from Cloud Center. Here I use a Linux EC2 m5.xlarge Headnode and 2 EC2 a set EC2 c6a.8xlarge instances, with a maximum of 16 workers per machine and 1 TB of storage. Both Headnode e Worker Machines run on Linux.

So, I'm using Parallel Server across multiple nodes.

Yes, I'm running Simulink simulations and yes, I can share the following snippet:

nsimulation = 32;
n=nsimulation;
in(1:n) = Simulink.SimulationInput('Name_Simulation');
for k = nsimulation:-1:1
    in(k) = in(k).setBlockParameter('Name_Simulation/V1','Value',num2str(V1(k)));
    in(k) = in(k).setBlockParameter('Name_Simulation/V2','Value',num2str(V2(k)));
    in(k) = in(k).setBlockParameter('Name_Simulation/V3','Value',num2str(V3(k)));
    in(k) = in(k).setBlockParameter('Name_Simulation/V4','Value',num2str(V4(k)));
    in(k) = in(k).setBlockParameter('Name_Simulation/V5','Value',num2str(V5(k)));
    in(k) = in(k).setBlockParameter('Name_Simulation/V6','Value',num2str(V6(k)));
    in(k) = in(k).setBlockParameter('Name_Simulation/V7','Value',num2str(V7(k)));
    in(k) = in(k).setBlockParameter('Name_Simulation/V8','Value',num2str(V8(k)));
end
in = in.setModelParameter('LoggingToFile', 'On', 'LoggingFileName', 'out.mat')
parpool("aws_parallel_server")
out = parsim(in, 'UseFastRestart', 'on', 'ShowProgress','on', 'ShowSimulationManager','on');

Damian Pietrus on 3 Jan 2024

Open in MATLAB Online

Before we try accessing those logs, I have one more thing for you to try since the behavior seems to be similar to an issue I ran into with another user. After starting up your MATLAB client before running any jobs, run the following command:

setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '600')

In this case, we are setting a communication timeout in the cluster to 600 seconds (10 minutes). Try running your code again and let me know how things go. If it's successful, I'll pass it along to our development team

Torben Ellegaard Lund on 28 Jun 2024

Edited: Torben Ellegaard Lund on 28 Jun 2024

Open in MATLAB Online

I had a smimilar problem on my MAcBook Air M2 running Sonoma 14.4.1 and MATLAB R2024a (24.1.0.2537033) where a parfor look kept chrashing. After using

setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '600')

some improvement was seen but it still kept chrashing (although after longer time). I then used the following:

setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '6000')

instead, and I have not experienced any chrashes since. The chrash used to happen both when storing data on a local SSD-harddrive and on an external SSH-harddrive connected via USB-C.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Damian Pietrus on 26 Feb 2025

Open in MATLAB Online

0 votes

When parallel workers start up, a timeout is in place that assumes the workers will all be ready within 30 seconds of one another. This may not always be enough time depending on your execution environment.

If you happen to look at the debug log, you may see a line similar to:

Timeout expired while calling MPI_Init_thread

We can adjust the timeout by adding an environment variable with a new timeout value. After starting up your MATLAB client before running any jobs, run the following command:

setenv('MW_PCT_TRANSPORT_HEARTBEAT_INTERVAL', '600');

In this case, we are setting a communication timeout in the cluster to 600 seconds (10 minutes). Try opening the pool again and note results. If this fixes the issue, please reach out to support@mathworks.com to see how this can be permanently added to your cluster profile.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

The parallel pool shut down because the client lost connection to worker

8 Comments
Show 6 older comments Hide 6 older comments

Answers (1)

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

The parallel pool shut down because the client lost connection to worker

8 Comments Show 6 older comments Hide 6 older comments

Answers (1)

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

8 Comments
Show 6 older comments Hide 6 older comments

0 Comments
Show -2 older comments Hide -2 older comments