[slurm-users] how do slurmctld determine whether a compute node is not responding?

Mon Jul 11 07:56:34 UTC 2022

On 7/11/22 09:32, taleintervenor at sjtu.edu.cn wrote:
> Recently we found some strange log in slurmctld.log about node not 
> responding, such as:
> 
> [2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding
> 
> [2022-07-09T03:23:58.098] Node node171 now responding
> 
> [2022-07-09T03:23:58.099] Node node165 now responding
> 
> [2022-07-09T03:23:58.099] Node node163 now responding
> 
> [2022-07-09T03:23:58.099] Node node172 now responding
> 
> [2022-07-09T03:23:58.099] Node node170 now responding
> 
> [2022-07-09T03:23:58.099] Node node175 now responding
> 
> [2022-07-09T03:23:58.099] Node node164 now responding
> 
> [2022-07-09T03:23:58.099] Node node178 now responding
> 
> [2022-07-09T03:23:58.099] Node node177 now responding
> 
> Meanwhile, checking slurmd.log and nhc.log on those node all seem to be ok 
> at the reported timepoint.
> 
> So we guess it’s slurmctld launch some detection towards those compute 
> node and didn’t get response, thus lead to slurmctld thinking those node 
> to be not responding.

Such node warnings could be caused by a broken network.  Or by your DNS 
servers not responding to DNS lookups so that "node177" is unknown, for 
example.

> Then the question is what detect action do slurmctld launched? How did it 
> determine whether a node is responsive or non-responsive?
> 
> And is it possible to customize slurmctld’s behavior on such detection, 
> for example wait timeout or retry count before determine the node to be 
> not responding?

See the slurm.conf parameters displayed by:

# scontrol show config | grep Timeout

We normally use this:

SlurmctldTimeout        = 600 sec
SlurmdTimeout           = 300 sec

/Ole