On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
John Hearns via slurm-users wrote:
I want to run a healtcheck job on all nodes.
And using HealthCheckProgram in the slurm.conf would be too easy?
But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.
I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.
Best regards, Ole