Ole Holm Nielsen via slurm-users <slurm-users@lists.schedmd.com> writes:
> On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
>> John Hearns via slurm-users wrote:
>>
>>> I want to run a healtcheck job on all nodes.
>> And using HealthCheckProgram in the slurm.conf would be too easy?
>
> But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd
> is started, and possibly when a new job is started.
That depends on HealthCheckInterval and HealthCheckNodeState. If
HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N
seconds, given that the node is in one of the HealthCheckNodeState
states (default: any state).
> I think John asked for a way to run NHC on a set of nodes whenever
> desired by the system administrator, and not at any any random time,
> right? ClusterShell is the ideal tool for making such parallel
> commands on the cluster.
Yes, for running manually, setting up the Slurm groups in clush is the
easiest way, IMO.
--
Regards,
Bjørn-Helge Mevik
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com