[slurm-users] how do slurmctld determine whether a compute node is not responding?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Mon Jul 11 07:56:34 UTC 2022
On 7/11/22 09:32, taleintervenor at sjtu.edu.cn wrote:
> Recently we found some strange log in slurmctld.log about node not
> responding, such as:
>
> [2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding
>
> [2022-07-09T03:23:58.098] Node node171 now responding
>
> [2022-07-09T03:23:58.099] Node node165 now responding
>
> [2022-07-09T03:23:58.099] Node node163 now responding
>
> [2022-07-09T03:23:58.099] Node node172 now responding
>
> [2022-07-09T03:23:58.099] Node node170 now responding
>
> [2022-07-09T03:23:58.099] Node node175 now responding
>
> [2022-07-09T03:23:58.099] Node node164 now responding
>
> [2022-07-09T03:23:58.099] Node node178 now responding
>
> [2022-07-09T03:23:58.099] Node node177 now responding
>
> Meanwhile, checking slurmd.log and nhc.log on those node all seem to be ok
> at the reported timepoint.
>
> So we guess it’s slurmctld launch some detection towards those compute
> node and didn’t get response, thus lead to slurmctld thinking those node
> to be not responding.
Such node warnings could be caused by a broken network. Or by your DNS
servers not responding to DNS lookups so that "node177" is unknown, for
example.
> Then the question is what detect action do slurmctld launched? How did it
> determine whether a node is responsive or non-responsive?
>
> And is it possible to customize slurmctld’s behavior on such detection,
> for example wait timeout or retry count before determine the node to be
> not responding?
See the slurm.conf parameters displayed by:
# scontrol show config | grep Timeout
We normally use this:
SlurmctldTimeout = 600 sec
SlurmdTimeout = 300 sec
/Ole
More information about the slurm-users
mailing list