[slurm-users] 答复: how do slurmctld determine whether a compute node is not responding?

Mon Jul 11 08:27:10 UTC 2022

Hello, Kamil Wilczek:

Well I agree that the non-responding case may caused by network unstable, since our slurm cluster has 2 part nodes geographical distant distributed with only ethernet link them. Those reported nodes are all in one building while the slurmctld node in another building.
But we can do nothing about the network infrastructure, so we are more interested in adjusting slurm to make it tolerate such short-time non responding case.
Or is there possible to tell slurmctld do those detection through certain proxy node? For example we have slurmctld backup node in the same building with those reported compute nodes, if slurm can use this backup controller node to execute detection towards some part of compute nodes, the result may be more stable.

-----邮件原件-----
发件人: Kamil Wilczek <> 
发送时间: 2022年7月11日 15:53
收件人: Slurm User Community List <slurm-users at lists.schedmd.com>; taleintervenor at sjtu.edu.cn
主题: Re: [slurm-users] how do slurmctld determine whether a compute node is not responding?

Hello,

I know that this is not quite the answer, but you could additionally (and maybe you already did this :)) check if this is not a network
problem:

* Are the nodes available outside of Slurm during that time? SSH, ping?
* If you have a monitoring system (Prometheus, Icinga, etc.), are
   there any issues reported?

And lastly, did you try to set log level to "debug" for "slurmd"
and "slurmctld"?

Kind Regards
-- 

W dniu 11.07.2022 o 09:32, taleintervenor at sjtu.edu.cn pisze:
> Hi, all:
> 
> Recently we found some strange log in slurmctld.log about node not 
> responding, such as:
> 
> [2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not 
> responding
> 
> [2022-07-09T03:23:58.098] Node node171 now responding
> 
> [2022-07-09T03:23:58.099] Node node165 now responding
> 
> [2022-07-09T03:23:58.099] Node node163 now responding
> 
> [2022-07-09T03:23:58.099] Node node172 now responding
> 
> [2022-07-09T03:23:58.099] Node node170 now responding
> 
> [2022-07-09T03:23:58.099] Node node175 now responding
> 
> [2022-07-09T03:23:58.099] Node node164 now responding
> 
> [2022-07-09T03:23:58.099] Node node178 now responding
> 
> [2022-07-09T03:23:58.099] Node node177 now responding
> 
> Meanwhile, checking slurmd.log and nhc.log on those node all seem to 
> be ok at the reported timepoint.
> 
> So we guess it’s slurmctld launch some detection towards those compute 
> node and didn’t get response, thus lead to slurmctld thinking those 
> node to be not responding.
> 
> Then the question is what detect action do slurmctld launched? How did 
> it determine whether a node is responsive or non-responsive?
> 
> And is it possible to customize slurmctld’s behavior on such 
> detection, for example wait timeout or retry count before determine 
> the node to be not responding?
> 

--
Kamil Wilczek  [https://keys.openpgp.org/] [D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/