On Mon, 18 Aug 2025 at 14:01, Bjørn-Helge Mevik via slurm-users <slurm-users@lists.schedmd.com> wrote:

Ole Holm Nielsen via slurm-users <slurm-users@lists.schedmd.com> writes:

> On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
>> John Hearns via slurm-users wrote:
>>
>>> I want to run a healtcheck job on all nodes.
>> And using HealthCheckProgram in the slurm.conf would be too easy?
>
> But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd
> is started, and possibly when a new job is started.

That depends on HealthCheckInterval and HealthCheckNodeState. If
HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N
seconds, given that the node is in one of the HealthCheckNodeState
states (default: any state).

> I think John asked for a way to run NHC on a set of nodes whenever
> desired by the system administrator, and not at any any random time,
> right? ClusterShell is the ideal tool for making such parallel
> commands on the cluster.

Yes, for running manually, setting up the Slurm groups in clush is the
easiest way, IMO.

--
Regards,
Bjørn-Helge Mevik

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com