Dear all,
Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the communication between the slurmctld and slurmd processes. We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately some nodes are in a different network preventing full internode communication. A network topology and setting TopologyParam RouteTree have been used to make sure no slurmd communication happens between nodes on different networks.
In the new Slurm version we see the following issues, which did not appear in 22.05:
1. slurmd processes acquire many network connections in CLOSE-WAIT (or CLOSE_WAIT depending on the tool used) causing the processes to hang, when trying to restart slurmd.
When checking for CLOSE-WAIT processes we see the following behaviour: Recv-Q Send-Q Local Address:Port Peer Address:Port Process
1 0 10.5.2.40:6818 10.5.0.43:58572 users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72)) 1 0 10.5.2.40:6818 10.5.0.43:58284 users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8)) 1 0 10.5.2.40:6818 10.5.0.43:58186 users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22)) 1 0 10.5.2.40:6818 10.5.0.43:58592 users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76)) 1 0 10.5.2.40:6818 10.5.0.43:58338 users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19)) 1 0 10.5.2.40:6818 10.5.0.43:58568 users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68)) 1 0 10.5.2.40:6818 10.5.0.43:58472 users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69)) 1 0 10.5.2.40:6818 10.5.0.43:58486 users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38)) 1 0 10.5.2.40:6818 10.5.0.43:58316 users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))
The first IP address is that of the compute node, the second that of the node running slurmctld. The nodes can communicate using these IP addresses just fine.
2. slurmd cannot be properly restarted [2024-01-18T10:45:26.589] slurmd version 23.11.1 started [2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address already in use [2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address already in use
This is probably because of the processes being in CLOSE-WAIT, which can only be killed using signal -9.
3. We see jobs stuck in completing CG state, probably due to communication issues between slurmctld and slurmd. The slurmctld sends repeated kill requests but those do not seem to be acknowledged by the client. This happens more often in large job arrays, or generally when many jobs start at the same time. However, this could be just a biased observation (i.e., it is more noticeable on large job arrays because there are more jobs to fail in the first place).
4. Since the new version we also see messages like: [2024-01-17T09:58:48.589] error: Failed to kill program loading user environment [2024-01-17T09:58:48.590] error: Failed to load current user environment variables [2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local environment, running only with passed environment The effect of this is that the users run with the wrong environment and can’t load the modules for the software that is needed by their jobs. This leads to many job failures.
The issue appears to be somewhat similar to the one described at: https://bugs.schedmd.com/show_bug.cgi?id=18561 In that case the site downgraded the slurmd clients to 22.05 which got rid of the problems. We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also seems to be a workaround for the issue.
Does anyone know of a better solution?
Kind regards,
Fokke Dijkstra
Do you have a firewall between the slurmd and the slurmctld daemons? If yes, do you know what kind of idle timeout that firewall has for expiring idle sessions? I ran into something somewhat similar but for me it was between the slurmctld and slurmdbd where a recent change they made had one direction between those two daemons left idle unless certain operations occurred and we did have a firewall device between them that was expiring sessions. In our case 23.11.1 brought a fix for that specific issue for us. I never had issues between slurmctld and slurmd (though the firewall is not between those two layers).
-- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150 http://bit.ly/1HO1N2C ________________________________ From: slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Fokke Dijkstra f.dijkstra@rug.nl Sent: Tuesday, January 23, 2024 4:00 AM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Issues with Slurm 23.11.1
Dear all,
Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the communication between the slurmctld and slurmd processes. We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately some nodes are in a different network preventing full internode communication. A network topology and setting TopologyParam RouteTree have been used to make sure no slurmd communication happens between nodes on different networks.
In the new Slurm version we see the following issues, which did not appear in 22.05:
1. slurmd processes acquire many network connections in CLOSE-WAIT (or CLOSE_WAIT depending on the tool used) causing the processes to hang, when trying to restart slurmd.
When checking for CLOSE-WAIT processes we see the following behaviour: Recv-Q Send-Q Local Address:Port Peer Address:Port Process 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58572http://10.5.0.43:58572 users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58284http://10.5.0.43:58284 users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58186http://10.5.0.43:58186 users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58592http://10.5.0.43:58592 users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58338http://10.5.0.43:58338 users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58568http://10.5.0.43:58568 users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58472http://10.5.0.43:58472 users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58486http://10.5.0.43:58486 users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38)) 1 0 10.5.2.40:6818http://10.5.2.40:6818 10.5.0.43:58316http://10.5.0.43:58316 users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))
The first IP address is that of the compute node, the second that of the node running slurmctld. The nodes can communicate using these IP addresses just fine.
2. slurmd cannot be properly restarted [2024-01-18T10:45:26.589] slurmd version 23.11.1 started [2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address already in use [2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address already in use
This is probably because of the processes being in CLOSE-WAIT, which can only be killed using signal -9.
3. We see jobs stuck in completing CG state, probably due to communication issues between slurmctld and slurmd. The slurmctld sends repeated kill requests but those do not seem to be acknowledged by the client. This happens more often in large job arrays, or generally when many jobs start at the same time. However, this could be just a biased observation (i.e., it is more noticeable on large job arrays because there are more jobs to fail in the first place).
4. Since the new version we also see messages like: [2024-01-17T09:58:48.589] error: Failed to kill program loading user environment [2024-01-17T09:58:48.590] error: Failed to load current user environment variables [2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local environment, running only with passed environment The effect of this is that the users run with the wrong environment and can’t load the modules for the software that is needed by their jobs. This leads to many job failures.
The issue appears to be somewhat similar to the one described at: https://bugs.schedmd.com/show_bug.cgi?id=18561 In that case the site downgraded the slurmd clients to 22.05 which got rid of the problems. We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also seems to be a workaround for the issue.
Does anyone know of a better solution?
Kind regards,
Fokke Dijkstra
-- Fokke Dijkstra f.dijkstra@rug.nlmailto:f.dijkstra@rug.nl Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands
Dear Brian,
Thanks for the hints, I think you are correctly pointing at some network connection issue. I've disabled firewalld on the control host, but that unfortunately did not help. The processes stuck in CLOSE-WAIT suggest indeed that network connections are not properly terminated. I've tried to adjust some network settings based on the suggestions for high throughput environments, but this didn't help either. I've also updated to 23.11.2, which also did not fix the problem. I am able to reproduce the issue by swamping a node with array jobs (128 jobs per node). This results in many of these jobs no longer being tracked correctly by the scheduler. Output for these jobs is missing and hundreds of slurmd connections are stuck in CLOSE-WAIT.
With slurmd at 23.02.7 and slurmctld and slurmdbd at 23.11.2 the problem does not show up. BTW I discovered that the job submission hosts must be at the same level as the compute nodes. Having the submission hosts still at 23.11 resulted in startup failures for jobs using Intel MPI. For the record the error message I got whas: [mpiexec@node24] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node24 (pid 1909692, exit code 65280) [mpiexec@node24] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error [mpiexec@node24] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error [mpiexec@node24] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event [mpiexec@node24] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies [mpiexec@node24] Possible reasons: [mpiexec@node24] 1. Host is unavailable. Please check that all hosts are available. [mpiexec@node24] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions. [mpiexec@node24] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable. [mpiexec@node24] 4. slurm bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I spent a whole day debugging the issue and could not find any clue what was causing this. OpenMPI jobs work fine. Finally I discovered that making sure the job submission hosts and the compute nodes are at the same Slurm level fixes the issue.
For now we'll stay in this configuration, to prevent having to get rid of all running and waiting jobs for a slurmctld downgrade. I'll test new Slurm 23.11 releases when they appear to see if that fixes the issue.
Kind regards,
Fokke
Op di 23 jan 2024 om 18:37 schreef Brian Haymore brian.haymore@utah.edu:
Do you have a firewall between the slurmd and the slurmctld daemons? If yes, do you know what kind of idle timeout that firewall has for expiring idle sessions? I ran into something somewhat similar but for me it was between the slurmctld and slurmdbd where a recent change they made had one direction between those two daemons left idle unless certain operations occurred and we did have a firewall device between them that was expiring sessions. In our case 23.11.1 brought a fix for that specific issue for us. I never had issues between slurmctld and slurmd (though the firewall is not between those two layers).
-- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150 http://bit.ly/1HO1N2C
*From:* slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Fokke Dijkstra f.dijkstra@rug.nl *Sent:* Tuesday, January 23, 2024 4:00 AM *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Issues with Slurm 23.11.1
Dear all,
Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the communication between the slurmctld and slurmd processes. We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately some nodes are in a different network preventing full internode communication. A network topology and setting TopologyParam RouteTree have been used to make sure no slurmd communication happens between nodes on different networks.
In the new Slurm version we see the following issues, which did not appear in 22.05:
- slurmd processes acquire many network connections in CLOSE-WAIT (or
CLOSE_WAIT depending on the tool used) causing the processes to hang, when trying to restart slurmd.
When checking for CLOSE-WAIT processes we see the following behaviour: Recv-Q Send-Q Local Address:Port Peer Address:Port Process
1 0 10.5.2.40:6818 10.5.0.43:58572 users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72)) 1 0 10.5.2.40:6818 10.5.0.43:58284 users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8)) 1 0 10.5.2.40:6818 10.5.0.43:58186 users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22)) 1 0 10.5.2.40:6818 10.5.0.43:58592 users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76)) 1 0 10.5.2.40:6818 10.5.0.43:58338 users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19)) 1 0 10.5.2.40:6818 10.5.0.43:58568 users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68)) 1 0 10.5.2.40:6818 10.5.0.43:58472 users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69)) 1 0 10.5.2.40:6818 10.5.0.43:58486 users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38)) 1 0 10.5.2.40:6818 10.5.0.43:58316 users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))
The first IP address is that of the compute node, the second that of the node running slurmctld. The nodes can communicate using these IP addresses just fine.
- slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started [2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address already in use [2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address already in use
This is probably because of the processes being in CLOSE-WAIT, which can only be killed using signal -9.
- We see jobs stuck in completing CG state, probably due to communication
issues between slurmctld and slurmd. The slurmctld sends repeated kill requests but those do not seem to be acknowledged by the client. This happens more often in large job arrays, or generally when many jobs start at the same time. However, this could be just a biased observation (i.e., it is more noticeable on large job arrays because there are more jobs to fail in the first place).
- Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user environment [2024-01-17T09:58:48.590] error: Failed to load current user environment variables [2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local environment, running only with passed environment The effect of this is that the users run with the wrong environment and can’t load the modules for the software that is needed by their jobs. This leads to many job failures.
The issue appears to be somewhat similar to the one described at: https://bugs.schedmd.com/show_bug.cgi?id=18561 In that case the site downgraded the slurmd clients to 22.05 which got rid of the problems. We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also seems to be a workaround for the issue.
Does anyone know of a better solution?
Kind regards,
Fokke Dijkstra
-- Fokke Dijkstra f.dijkstra@rug.nl f.dijkstra@rug.nl Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands
This is a quick update on the status. Upgrading to Slurm 23.11.4 fixed the issue. It appears we were bitten by the following bug: -- Fix stuck processes and incorrect environment when using --get-user-env This was triggered for us because we had set SBATCH_EXPORT=NONE for our users.
Kind regards,
Fokke
Op wo 24 jan 2024 om 16:19 schreef Fokke Dijkstra f.dijkstra@rug.nl:
Dear Brian,
Thanks for the hints, I think you are correctly pointing at some network connection issue. I've disabled firewalld on the control host, but that unfortunately did not help. The processes stuck in CLOSE-WAIT suggest indeed that network connections are not properly terminated. I've tried to adjust some network settings based on the suggestions for high throughput environments, but this didn't help either. I've also updated to 23.11.2, which also did not fix the problem. I am able to reproduce the issue by swamping a node with array jobs (128 jobs per node). This results in many of these jobs no longer being tracked correctly by the scheduler. Output for these jobs is missing and hundreds of slurmd connections are stuck in CLOSE-WAIT.
With slurmd at 23.02.7 and slurmctld and slurmdbd at 23.11.2 the problem does not show up. BTW I discovered that the job submission hosts must be at the same level as the compute nodes. Having the submission hosts still at 23.11 resulted in startup failures for jobs using Intel MPI. For the record the error message I got whas: [mpiexec@node24] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node24 (pid 1909692, exit code 65280) [mpiexec@node24] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error [mpiexec@node24] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error [mpiexec@node24] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event [mpiexec@node24] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies [mpiexec@node24] Possible reasons: [mpiexec@node24] 1. Host is unavailable. Please check that all hosts are available. [mpiexec@node24] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions. [mpiexec@node24] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable. [mpiexec@node24] 4. slurm bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
I spent a whole day debugging the issue and could not find any clue what was causing this. OpenMPI jobs work fine. Finally I discovered that making sure the job submission hosts and the compute nodes are at the same Slurm level fixes the issue.
For now we'll stay in this configuration, to prevent having to get rid of all running and waiting jobs for a slurmctld downgrade. I'll test new Slurm 23.11 releases when they appear to see if that fixes the issue.
Kind regards,
Fokke
Op di 23 jan 2024 om 18:37 schreef Brian Haymore brian.haymore@utah.edu:
Do you have a firewall between the slurmd and the slurmctld daemons? If yes, do you know what kind of idle timeout that firewall has for expiring idle sessions? I ran into something somewhat similar but for me it was between the slurmctld and slurmdbd where a recent change they made had one direction between those two daemons left idle unless certain operations occurred and we did have a firewall device between them that was expiring sessions. In our case 23.11.1 brought a fix for that specific issue for us. I never had issues between slurmctld and slurmd (though the firewall is not between those two layers).
-- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150 http://bit.ly/1HO1N2C
*From:* slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Fokke Dijkstra f.dijkstra@rug.nl *Sent:* Tuesday, January 23, 2024 4:00 AM *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Issues with Slurm 23.11.1
Dear all,
Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the communication between the slurmctld and slurmd processes. We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately some nodes are in a different network preventing full internode communication. A network topology and setting TopologyParam RouteTree have been used to make sure no slurmd communication happens between nodes on different networks.
In the new Slurm version we see the following issues, which did not appear in 22.05:
- slurmd processes acquire many network connections in CLOSE-WAIT (or
CLOSE_WAIT depending on the tool used) causing the processes to hang, when trying to restart slurmd.
When checking for CLOSE-WAIT processes we see the following behaviour: Recv-Q Send-Q Local Address:Port Peer Address:Port Process
1 0 10.5.2.40:6818 10.5.0.43:58572 users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72)) 1 0 10.5.2.40:6818 10.5.0.43:58284 users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8)) 1 0 10.5.2.40:6818 10.5.0.43:58186 users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22)) 1 0 10.5.2.40:6818 10.5.0.43:58592 users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76)) 1 0 10.5.2.40:6818 10.5.0.43:58338 users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19)) 1 0 10.5.2.40:6818 10.5.0.43:58568 users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68)) 1 0 10.5.2.40:6818 10.5.0.43:58472 users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69)) 1 0 10.5.2.40:6818 10.5.0.43:58486 users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38)) 1 0 10.5.2.40:6818 10.5.0.43:58316 users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))
The first IP address is that of the compute node, the second that of the node running slurmctld. The nodes can communicate using these IP addresses just fine.
- slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started [2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address already in use [2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address already in use
This is probably because of the processes being in CLOSE-WAIT, which can only be killed using signal -9.
- We see jobs stuck in completing CG state, probably due to
communication issues between slurmctld and slurmd. The slurmctld sends repeated kill requests but those do not seem to be acknowledged by the client. This happens more often in large job arrays, or generally when many jobs start at the same time. However, this could be just a biased observation (i.e., it is more noticeable on large job arrays because there are more jobs to fail in the first place).
- Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user environment [2024-01-17T09:58:48.590] error: Failed to load current user environment variables [2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local environment, running only with passed environment The effect of this is that the users run with the wrong environment and can’t load the modules for the software that is needed by their jobs. This leads to many job failures.
The issue appears to be somewhat similar to the one described at: https://bugs.schedmd.com/show_bug.cgi?id=18561 In that case the site downgraded the slurmd clients to 22.05 which got rid of the problems. We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also seems to be a workaround for the issue.
Does anyone know of a better solution?
Kind regards,
Fokke Dijkstra
-- Fokke Dijkstra f.dijkstra@rug.nl f.dijkstra@rug.nl Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands
-- Fokke Dijkstra f.dijkstra@rug.nl f.dijkstra@rug.nl Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands