[slurm-users] nodes reported as "not responding"

Mon Jul 3 13:05:22 UTC 2023

Hello,

After the addition of nodes to nodes.conf and their simultaneous removal 
from nodes_down.conf, where they were marked in "State=FUTURE", plus 
slurmctl reconfig(ure) and restart of slurmctld, several of the added 
nodes were reported as "not responding" following a very regular time 
pattern. This happened for nodes added in 'drain` state and for nodes 
added directly in active partitions as well, so for a short time sinfo 
was showing them in 'partition*', then say for half an hour in 
'partition', then in 'partition*' again and so on ... at times they were 
set in 'down' by the controller. All test of the network for those nodes 
were always fine at the same time when the controller was marking the 
nodes as unresponsive.

To better understand the problem, does anyone know how the controller 
decides that a node is or is not responding? I would like, in case the 
problem reappears, to be able to
reproduce on the command line the conditions which led the controller to 
mark some nodes as not responding.

Does anyone know what could cause the issue? Is it maybe bound to the 
activation of 'FUTURE' nodes? In our case it was solved probably by 
increasing the default value of the TreeWidth parameter (from fifty to 
more than the number of nodes) and in one case by undraining the nodes.

Our Slurm version: 21.08.8-2

Thanks, cheers,

     Raffaele