[slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

Yair Yarom irush at cs.huji.ac.il
Wed Dec 2 08:56:41 UTC 2020


Hi,

We also noticed this. We eventually placed the max time on the
HealthCheckInterval (65535), and created a systemd.timer which runs the
scripts externally of slurm, with proper intervals and randomized delays.

    Yair.

On Wed, Dec 2, 2020 at 9:03 AM <taleintervenor at sjtu.edu.cn> wrote:

> Hello,
>
>
>
> Our slurm cluster managed about 600+ nodes and I tested to set
> HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
> this to CYCLE shall cause slurm to “cycle through running on all compute
> nodes through the course of the HealthCheckInterval”. So I set
> “HealthCheckInterval = 600”, and expected the health check time point can
> be evenly distributed across the 600 seconds period.
>
> But the test result showed that the earliest checked node is at about
> 14:19:35, while the latest checked node is at about 14:20:39. A round of
> the health checks only distributed across 60+ seconds? And the previous
> checking round distributed from 14:08:10 to 14:09:26, it seems the
> HealthCheckInterval only control the time interval between two rounds, not
> the time range distributed by one round checkings.
>
> So did I mistake the description in conf’s manual? And is there any method
> can control the health check frequency in one round between different nodes?
>
>
>
> Thanks.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201202/a3769020/attachment.htm>


More information about the slurm-users mailing list