[slurm-users] how do slurmctld determine whether a compute node is not responding?

Kamil Wilczek kmwil at mimuw.edu.pl
Mon Jul 11 07:53:23 UTC 2022


I know that this is not quite the answer, but you could additionally
(and maybe you already did this :)) check if this is not a network

* Are the nodes available outside of Slurm during that time? SSH, ping?
* If you have a monitoring system (Prometheus, Icinga, etc.), are
   there any issues reported?

And lastly, did you try to set log level to "debug" for "slurmd"
and "slurmctld"?

Kind Regards

W dniu 11.07.2022 o 09:32, taleintervenor at sjtu.edu.cn pisze:
> Hi, all:
> Recently we found some strange log in slurmctld.log about node not 
> responding, such as:
> [2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding
> [2022-07-09T03:23:58.098] Node node171 now responding
> [2022-07-09T03:23:58.099] Node node165 now responding
> [2022-07-09T03:23:58.099] Node node163 now responding
> [2022-07-09T03:23:58.099] Node node172 now responding
> [2022-07-09T03:23:58.099] Node node170 now responding
> [2022-07-09T03:23:58.099] Node node175 now responding
> [2022-07-09T03:23:58.099] Node node164 now responding
> [2022-07-09T03:23:58.099] Node node178 now responding
> [2022-07-09T03:23:58.099] Node node177 now responding
> Meanwhile, checking slurmd.log and nhc.log on those node all seem to be 
> ok at the reported timepoint.
> So we guess it’s slurmctld launch some detection towards those compute 
> node and didn’t get response, thus lead to slurmctld thinking those node 
> to be not responding.
> Then the question is what detect action do slurmctld launched? How did 
> it determine whether a node is responsive or non-responsive?
> And is it possible to customize slurmctld’s behavior on such detection, 
> for example wait timeout or retry count before determine the node to be 
> not responding?

Kamil Wilczek  [https://keys.openpgp.org/]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki
Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220711/560395df/attachment.sig>

More information about the slurm-users mailing list