Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com writes:
On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
John Hearns via slurm-users wrote:
I want to run a healtcheck job on all nodes.
And using HealthCheckProgram in the slurm.conf would be too easy?
But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.
That depends on HealthCheckInterval and HealthCheckNodeState. If HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N seconds, given that the node is in one of the HealthCheckNodeState states (default: any state).
I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.
Yes, for running manually, setting up the Slurm groups in clush is the easiest way, IMO.