[slurm-users] how do slurmctld determine whether a compute node is not responding?

Kamil Wilczek kmwil at mimuw.edu.pl
Mon Jul 11 07:53:23 UTC 2022


Hello,

I know that this is not quite the answer, but you could additionally
(and maybe you already did this :)) check if this is not a network
problem:

* Are the nodes available outside of Slurm during that time? SSH, ping?
* If you have a monitoring system (Prometheus, Icinga, etc.), are
   there any issues reported?

And lastly, did you try to set log level to "debug" for "slurmd"
and "slurmctld"?

Kind Regards
-- 

W dniu 11.07.2022 o 09:32, taleintervenor at sjtu.edu.cn pisze:
> Hi, all:
> 
> Recently we found some strange log in slurmctld.log about node not 
> responding, such as:
> 
> [2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding
> 
> [2022-07-09T03:23:58.098] Node node171 now responding
> 
> [2022-07-09T03:23:58.099] Node node165 now responding
> 
> [2022-07-09T03:23:58.099] Node node163 now responding
> 
> [2022-07-09T03:23:58.099] Node node172 now responding
> 
> [2022-07-09T03:23:58.099] Node node170 now responding
> 
> [2022-07-09T03:23:58.099] Node node175 now responding
> 
> [2022-07-09T03:23:58.099] Node node164 now responding
> 
> [2022-07-09T03:23:58.099] Node node178 now responding
> 
> [2022-07-09T03:23:58.099] Node node177 now responding
> 
> Meanwhile, checking slurmd.log and nhc.log on those node all seem to be 
> ok at the reported timepoint.
> 
> So we guess it’s slurmctld launch some detection towards those compute 
> node and didn’t get response, thus lead to slurmctld thinking those node 
> to be not responding.
> 
> Then the question is what detect action do slurmctld launched? How did 
> it determine whether a node is responsive or non-responsive?
> 
> And is it possible to customize slurmctld’s behavior on such detection, 
> for example wait timeout or retry count before determine the node to be 
> not responding?
> 

-- 
Kamil Wilczek  [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki
Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220711/560395df/attachment.sig>


More information about the slurm-users mailing list