[slurm-users] 答复: how do slurmctld determine whether a compute node is not responding?

Mon Jul 11 08:37:59 UTC 2022

I think that the previous answer from Ole.H.Nielsen at fysik.dtu.dk
might be helpful in that case. But whether this is in the same
building or in some more distatnt location the timeout shouldn't
exceed a second or two. I do not understand why the timeouts are
set so high by default -- workloads messing up the network?

My defaults are (I did not change them):

BatchStartTimeout       = 10 sec
EioTimeout              = 60
GetEnvTimeout           = 2 sec
MessageTimeout          = 10 sec
PrologEpilogTimeout     = 65534
ResumeTimeout           = 60 sec
SlurmctldTimeout        = 120 sec
SlurmdTimeout           = 300 sec
SuspendTimeout          = 30 sec
TCPTimeout              = 2 sec
UnkillableStepTimeout   = 60 sec

The "TCPTimeout" might also be a viable option, because this would
be the first limit to reach and nodes under heavy load might have
problems with network operation, especially if some resources
are not reserved for the OS.

-- 

W dniu 11.07.2022 o 10:27, taleintervenor at sjtu.edu.cn pisze:
> Hello, Kamil Wilczek:
> 
> Well I agree that the non-responding case may caused by network unstable, since our slurm cluster has 2 part nodes geographical distant distributed with only ethernet link them. Those reported nodes are all in one building while the slurmctld node in another building.
> But we can do nothing about the network infrastructure, so we are more interested in adjusting slurm to make it tolerate such short-time non responding case.
> Or is there possible to tell slurmctld do those detection through certain proxy node? For example we have slurmctld backup node in the same building with those reported compute nodes, if slurm can use this backup controller node to execute detection towards some part of compute nodes, the result may be more stable.
> 
> -----邮件原件-----
> 发件人: Kamil Wilczek <>
> 发送时间: 2022年7月11日 15:53
> 收件人: Slurm User Community List <slurm-users at lists.schedmd.com>; taleintervenor at sjtu.edu.cn
> 主题: Re: [slurm-users] how do slurmctld determine whether a compute node is not responding?
> 
> Hello,
> 
> I know that this is not quite the answer, but you could additionally (and maybe you already did this :)) check if this is not a network
> problem:
> 
> * Are the nodes available outside of Slurm during that time? SSH, ping?
> * If you have a monitoring system (Prometheus, Icinga, etc.), are
>     there any issues reported?
> 
> And lastly, did you try to set log level to "debug" for "slurmd"
> and "slurmctld"?
> 
> Kind Regards

-- 
Kamil Wilczek  [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki
Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220711/2dd132ff/attachment.sig>