Run a healthcheck job on all nodes

List overview All Threads
Download

newer

older

Nodes Become Invalid Due to Less...

SLUG 25?

John Hearns

18 Aug 2025 18 Aug '25

4 a.m.

I may have asked this already.

I want to run a healtcheck job on all nodes. I can select the nodes in a partition by hand, the write a bash cript to get a list of nodes using nodeset -e Then submit to each node in the list using sbatch -w

Is there a cleaner way of doing this?

John Hearns

Attachments:

attachment.html (text/html — 435 bytes)

Show replies by date

Ole Holm Nielsen

18 Aug 18 Aug

4:33 a.m.

Hi John,

Nice to hear from you again!

On 8/18/25 13:00, John Hearns via slurm-users wrote:

...

I want to run a healtcheck job on all nodes. I can select the nodes in a partition by hand, the write a bash cript to get a list of nodes using nodeset -e Then submit to each node in the list using sbatch -w

Is there a cleaner way of doing this?

IMHO the cleanest way is to use the great ClusterShell tool[1], where Slurm partitions and nodes can be configured as shown in the Wiki examples. For example, to run NHC on all nodes:

$ clush -ba nhc

Best regards, Ole

[1] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell

Gerhard Strangar

4:56 a.m.

John Hearns via slurm-users wrote:

...

I want to run a healtcheck job on all nodes.

And using HealthCheckProgram in the slurm.conf would be too easy?

Ole Holm Nielsen

5:42 a.m.

On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:

...

John Hearns via slurm-users wrote:

...
I want to run a healtcheck job on all nodes.

And using HealthCheckProgram in the slurm.conf would be too easy?

But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.

I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.

Best regards, Ole

Bjørn-Helge Mevik

5:57 a.m.

Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com writes:

...

On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:

...
John Hearns via slurm-users wrote:

...
I want to run a healtcheck job on all nodes.

And using HealthCheckProgram in the slurm.conf would be too easy?

But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.

That depends on HealthCheckInterval and HealthCheckNodeState. If HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N seconds, given that the node is in one of the HealthCheckNodeState states (default: any state).

...

I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.

Yes, for running manually, setting up the Slurm groups in clush is the easiest way, IMO.

-- Regards, Bjørn-Helge Mevik

John Hearns

6:14 a.m.

Thankyou both. For interest, this is the health check

https://github.com/amd/node-scraper/

On Mon, 18 Aug 2025 at 14:01, Bjørn-Helge Mevik via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com writes:

...
On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:

...
John Hearns via slurm-users wrote:

...
I want to run a healtcheck job on all nodes.

And using HealthCheckProgram in the slurm.conf would be too easy?

But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is started, and possibly when a new job is started.

That depends on HealthCheckInterval and HealthCheckNodeState. If HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N seconds, given that the node is in one of the HealthCheckNodeState states (default: any state).

...
I think John asked for a way to run NHC on a set of nodes whenever desired by the system administrator, and not at any any random time, right? ClusterShell is the ideal tool for making such parallel commands on the cluster.

Yes, for running manually, setting up the Slurm groups in clush is the easiest way, IMO.

-- Regards, Bjørn-Helge Mevik

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

194

Age (days ago)

194

Last active (days ago)

slurm-users@lists.schedmd.com

5 comments

4 participants

tags (0)

participants (4)

Bjørn-Helge Mevik
Gerhard Strangar
John Hearns
Ole Holm Nielsen