[slurm-users] nodes reported as "not responding"
r.grosso at gsi.de
Mon Jul 3 13:05:22 UTC 2023
After the addition of nodes to nodes.conf and their simultaneous removal
from nodes_down.conf, where they were marked in "State=FUTURE", plus
slurmctl reconfig(ure) and restart of slurmctld, several of the added
nodes were reported as "not responding" following a very regular time
pattern. This happened for nodes added in 'drain` state and for nodes
added directly in active partitions as well, so for a short time sinfo
was showing them in 'partition*', then say for half an hour in
'partition', then in 'partition*' again and so on ... at times they were
set in 'down' by the controller. All test of the network for those nodes
were always fine at the same time when the controller was marking the
nodes as unresponsive.
To better understand the problem, does anyone know how the controller
decides that a node is or is not responding? I would like, in case the
problem reappears, to be able to
reproduce on the command line the conditions which led the controller to
mark some nodes as not responding.
Does anyone know what could cause the issue? Is it maybe bound to the
activation of 'FUTURE' nodes? In our case it was solved probably by
increasing the default value of the TreeWidth parameter (from fifty to
more than the number of nodes) and in one case by undraining the nodes.
Our Slurm version: 21.08.8-2
More information about the slurm-users