[slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

taleintervenor at sjtu.edu.cn taleintervenor at sjtu.edu.cn
Wed Dec 2 06:56:37 UTC 2020


Hello,

 

Our slurm cluster managed about 600+ nodes and I tested to set
HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
this to CYCLE shall cause slurm to "cycle through running on all compute
nodes through the course of the HealthCheckInterval". So I set
"HealthCheckInterval = 600", and expected the health check time point can be
evenly distributed across the 600 seconds period.

But the test result showed that the earliest checked node is at about
14:19:35, while the latest checked node is at about 14:20:39. A round of the
health checks only distributed across 60+ seconds? And the previous checking
round distributed from 14:08:10 to 14:09:26, it seems the
HealthCheckInterval only control the time interval between two rounds, not
the time range distributed by one round checkings.

So did I mistake the description in conf's manual? And is there any method
can control the health check frequency in one round between different nodes?

 

Thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201202/9de21c02/attachment.htm>


More information about the slurm-users mailing list