I may have asked this already.
I want to run a healtcheck job on all nodes. I can select the nodes in a partition by hand, the write a bash cript to get a list of nodes using nodeset -e Then submit to each node in the list using sbatch -w
Is there a cleaner way of doing this?
John Hearns
Hi John,
Nice to hear from you again!
On 8/18/25 13:00, John Hearns via slurm-users wrote:
I want to run a healtcheck job on all nodes. I can select the nodes in a partition by hand, the write a bash cript to get a list of nodes using nodeset -e Then submit to each node in the list using sbatch -w
Is there a cleaner way of doing this?
IMHO the cleanest way is to use the great ClusterShell tool[1], where Slurm partitions and nodes can be configured as shown in the Wiki examples. For example, to run NHC on all nodes:
$ clush -ba nhc
Best regards, Ole
[1] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell
On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
John Hearns via slurm-users wrote:
I want to run a healtcheck job on all nodes.
And using HealthCheckProgram in the slurm.conf would be too easy?
But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.
I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.
Best regards, Ole
Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com writes:
On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
John Hearns via slurm-users wrote:
I want to run a healtcheck job on all nodes.
And using HealthCheckProgram in the slurm.conf would be too easy?
But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.
That depends on HealthCheckInterval and HealthCheckNodeState. If HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N seconds, given that the node is in one of the HealthCheckNodeState states (default: any state).
I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.
Yes, for running manually, setting up the Slurm groups in clush is the easiest way, IMO.
Thankyou both. For interest, this is the health check
https://github.com/amd/node-scraper/
On Mon, 18 Aug 2025 at 14:01, Bjørn-Helge Mevik via slurm-users < slurm-users@lists.schedmd.com> wrote:
Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com writes:
On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
John Hearns via slurm-users wrote:
I want to run a healtcheck job on all nodes.
And using HealthCheckProgram in the slurm.conf would be too easy?
But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.
That depends on HealthCheckInterval and HealthCheckNodeState. If HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N seconds, given that the node is in one of the HealthCheckNodeState states (default: any state).
I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.
Yes, for running manually, setting up the Slurm groups in clush is the easiest way, IMO.
-- Regards, Bjørn-Helge Mevik
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com