[slurm-users] Issues with Slurm 23.11.1

Fokke Dijkstra f.dijkstra at rug.nl
Tue Jan 23 11:00:03 UTC 2024


Dear all,

Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with
the communication between the slurmctld and slurmd processes.
We are running a cluster with 183 nodes and almost 19000 cores.
Unfortunately some nodes are in a different network preventing full
internode communication. A network topology and setting TopologyParam
RouteTree have been used to make sure no slurmd communication happens
between nodes on different networks.

In the new Slurm version we see the following issues, which did not appear
in 22.05:

1. slurmd processes acquire many network connections in CLOSE-WAIT (or
CLOSE_WAIT depending on the tool used) causing the processes to hang, when
trying to restart slurmd.

When checking for CLOSE-WAIT processes we see the following behaviour:
Recv-Q Send-Q Local Address:Port  Peer Address:Port Process

1      0          10.5.2.40:6818     10.5.0.43:58572
users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72))
1      0          10.5.2.40:6818     10.5.0.43:58284
users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8))
1      0          10.5.2.40:6818     10.5.0.43:58186
users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22))
1      0          10.5.2.40:6818     10.5.0.43:58592
users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76))
1      0          10.5.2.40:6818     10.5.0.43:58338
users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19))
1      0          10.5.2.40:6818     10.5.0.43:58568
users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68))
1      0          10.5.2.40:6818     10.5.0.43:58472
users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69))
1      0          10.5.2.40:6818     10.5.0.43:58486
users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38))
1      0          10.5.2.40:6818     10.5.0.43:58316
users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))

The first IP address is that of the compute node, the second that of the
node running slurmctld. The nodes can communicate using these IP addresses
just fine.

2. slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started
[2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address
already in use
[2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address
already in use

This is probably because of the processes being in CLOSE-WAIT, which can
only be killed using signal -9.

3. We see jobs stuck in completing CG state, probably due to communication
issues between slurmctld and slurmd. The slurmctld sends repeated kill
requests but those do not seem to be acknowledged by the client. This
happens more often in large job arrays, or generally when many jobs start
at the same time. However, this could be just a biased observation (i.e.,
it is more noticeable on large job arrays because there are more jobs to
fail in the first place).

4. Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user
environment
[2024-01-17T09:58:48.590] error: Failed to load current user environment
variables
[2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local
environment, running only with passed environment
The effect of this is that the users run with the wrong environment and
can’t load the modules for the software that is needed by their jobs. This
leads to many job failures.

The issue appears to be somewhat similar to the one described at:
https://bugs.schedmd.com/show_bug.cgi?id=18561
In that case the site downgraded the slurmd clients to 22.05 which got rid
of the problems.
We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also
seems to be a workaround for the issue.

Does anyone know of a better solution?

Kind regards,

Fokke Dijkstra

-- 
Fokke Dijkstra <f.dijkstra at rug.nl> <f.dijkstra at rug.nl>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240123/900e72ad/attachment.htm>


More information about the slurm-users mailing list