Node Health Check Program

List overview All Threads
Download

newer

older

Assistance with Node Restrictions...

Slurm 25.05: Retrieving jobs GPU...

Paul Edmon

19 Aug 2025 19 Aug '25

4:20 p.m.

We've been using NHC (https://github.com/mej/nhc) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks?

-Paul Edmon-

Show replies by date

Valerio Bellizzomi

19 Aug 19 Aug

4:52 p.m.

On Tue, 2025-08-19 at 10:20 -0400, Paul Edmon via slurm-users wrote:

...

We've been using NHC (https://github.com/mej/nhc) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks?

-Paul Edmon-

I guess that checking for open ports with nmap should be sufficient to tell that the daemons are up and responding:

nmap -p <port> <ip-address range>

John Hearns

5:05 p.m.

I havent run this as a Node Health Check script. However it does run undel RHEL 9

https://github.com/amd/node-scraper

On Tue, 19 Aug 2025 at 15:23, Paul Edmon via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

We've been using NHC (https://github.com/mej/nhc) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks?

-Paul Edmon-

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Valerio Bellizzomi

5:24 p.m.

On Tue, 2025-08-19 at 16:05 +0100, John Hearns via slurm-users wrote:

...

I havent run this as a Node Health Check script. However it does run undel RHEL 9

It is a python script, it should run on any system.

...

https://github.com/amd/node-scraper

On Tue, 19 Aug 2025 at 15:23, Paul Edmon via slurm-users slurm-users@lists.schedmd.com wrote:

...
We've been using NHC (https://github.com/mej/nhc) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks?

-Paul Edmon-

Otto, Frank

5:45 p.m.

Hi Paul,

the dev branch of NHC is more up to date (though also 7 months stale now) and we are running this on RHEL9.6 with Slurm 24.11. Admittedly, we haven't been running it very long yet, so there might be issues we just haven't encountered yet, but in general it seems to work.

Kind regards, Frank

-- Dr. Frank Otto Principal Research Infrastructure Developer UCL Advanced Research Computing Centre Tel: 020 7679 1506 ________________________________ From: Paul Edmon via slurm-users slurm-users@lists.schedmd.com Sent: 19 August 2025 15:20 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Node Health Check Program

⚠ Caution: External sender

-Paul Edmon-

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Paul Edmon

6:06 p.m.

Great. I will check that out.

-Paul Edmon-

On 8/19/25 11:45 AM, Otto, Frank wrote:

...

Hi Paul,

the dev branch of NHC is more up to date (though also 7 months stale now) and we are running this on RHEL9.6 with Slurm 24.11. Admittedly, we haven't been running it very long yet, so there might be issues we just haven't encountered yet, but in general it seems to work.

Kind regards, Frank

-- Dr. Frank Otto Principal Research Infrastructure Developer UCL Advanced Research Computing Centre Tel: 020 7679 1506

*From:* Paul Edmon via slurm-users slurm-users@lists.schedmd.com *Sent:* 19 August 2025 15:20 *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Node Health Check Program ⚠ Caution: External sender

We've been using NHC (https://github.com/mej/nhc) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks?

-Paul Edmon-

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Jennings, Michael E

9:25 p.m.

New subject: [EXTERNAL] Node Health Check Program

Hi Paul!

Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Well, to be fair, that's not exactly true. I could just add them in en masse — no tidying, no unit tests, no code reviews — and believe me, I've come close to doing it several times! If enough folks out there would find it useful at their site(s), I'm sure I could be persuaded. :-)

For better or worse, RHEL9 was only recently approved by the security folks here, so very few servers and clusters are running it (or derivatives like Alma or Rocky). I do have a RHEL9-based VM running the same version I noted above (for servers); the only problem I ran into so far is updating the `sshd` service check due to the fact that, with OpenSSH 8.7, even the primary daemon process (the listener) rewrites its `argv[]` data. Here's what I'm using that works:

* || check_ps_service -VS -u root -fm 'sshd: /*/sshd* -D*' sshd

I don't see any other mention of RHEL9-centric issue reports; did I miss one?

In any event, the project isn't dead, I swear! And for what it's worth, it won't be going away any time soon; LANL HPC (independently of myself) evaluated all the available options at the time, on at least 3 separate occasions, and consistently found `lbnl-nhc` to be the best choice. Since that time, it's been deployed on almost all services hosts (Quay servers, GitLab servers, OpenShift prime and worker nodes, our virtualization cluster, and numerous others) as well as all production clusters and supporting infrastructure. NHC feeds its results into Splunk (hence the recently added JSON support), and we also use things like telegraf and LDMS, but in terms of situational awareness at the OS, scheduler, and cluster hardware levels, NHC is everywhere, and we've invested quite a bit into it in terms of time, training, and ancillary efforts (like our NHC Ansible role).

I'm not able to spend 90+% of my time on NHC right now, as I was briefly able to do last year, but it is still being developed and deployed at scale.

As far as forks go, the only thing I'm aware of in that vein is work that comes from the great team over at the University of Ghent in Belgium. Their tree (github.com/hpcugent/nhc) was still undergoing development while I was stuck in legal limbo with Feynman, but I haven't checked recently to see if they've merged any of the recent features (and some pretty significant bugfixes, primarily around process management).

Hope that helps! Michael

-- Michael E. Jennings (he/him) mej@lanl.gov https://hpc.lanl.gov/ HPC Platform Integration Engineer - Platforms Design Team - HPC Design Group Ultra-Scale Research Center (USRC), 4200 W Jemez #301-25 +1 (505) 412-4151 Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

________________________________________ From: Paul Edmon via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, August 19, 2025 08:20 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [EXTERNAL] [slurm-users] Node Health Check Program We've been using NHC (https://urldefense.com/v3/__https://github.com/mej/nhc__;!!Bt8fGhp8LhKGRg!Bk... ) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks?

-Paul Edmon-

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Paul Edmon

9:37 p.m.

New subject: [EXTERNAL] Node Health Check Program

Thanks, that explains why no EL9 release has appeared yet. I tried out the dev branch and it works great.

NHC has been awesome to use (we've been using it for years). Thanks for maintaining it!

-Paul Edmon-

On 8/19/25 3:25 PM, Jennings, Michael E wrote:

...

Hi Paul!

Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Well, to be fair, that's not exactly true. I could just add them in en masse — no tidying, no unit tests, no code reviews — and believe me, I've come close to doing it several times! If enough folks out there would find it useful at their site(s), I'm sure I could be persuaded. :-)

For better or worse, RHEL9 was only recently approved by the security folks here, so very few servers and clusters are running it (or derivatives like Alma or Rocky). I do have a RHEL9-based VM running the same version I noted above (for servers); the only problem I ran into so far is updating the `sshd` service check due to the fact that, with OpenSSH 8.7, even the primary daemon process (the listener) rewrites its `argv[]` data. Here's what I'm using that works:

|| check_ps_service -VS -u root -fm 'sshd: /*/sshd* -D*' sshd

I don't see any other mention of RHEL9-centric issue reports; did I miss one?

In any event, the project isn't dead, I swear! And for what it's worth, it won't be going away any time soon; LANL HPC (independently of myself) evaluated all the available options at the time, on at least 3 separate occasions, and consistently found `lbnl-nhc` to be the best choice. Since that time, it's been deployed on almost all services hosts (Quay servers, GitLab servers, OpenShift prime and worker nodes, our virtualization cluster, and numerous others) as well as all production clusters and supporting infrastructure. NHC feeds its results into Splunk (hence the recently added JSON support), and we also use things like telegraf and LDMS, but in terms of situational awareness at the OS, scheduler, and cluster hardware levels, NHC is everywhere, and we've invested quite a bit into it in terms of time, training, and ancillary efforts (like our NHC Ansible role).

I'm not able to spend 90+% of my time on NHC right now, as I was briefly able to do last year, but it is still being developed and deployed at scale.

As far as forks go, the only thing I'm aware of in that vein is work that comes from the great team over at the University of Ghent in Belgium. Their tree (github.com/hpcugent/nhc) was still undergoing development while I was stuck in legal limbo with Feynman, but I haven't checked recently to see if they've merged any of the recent features (and some pretty significant bugfixes, primarily around process management).

Hope that helps! Michael

-- Michael E. Jennings (he/him) mej@lanl.gov https://hpc.lanl.gov/ HPC Platform Integration Engineer - Platforms Design Team - HPC Design Group Ultra-Scale Research Center (USRC), 4200 W Jemez #301-25 +1 (505) 412-4151 Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

From: Paul Edmon via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, August 19, 2025 08:20 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [EXTERNAL] [slurm-users] Node Health Check Program

We've been using NHC (https://urldefense.com/v3/__https://github.com/mej/nhc__;!!Bt8fGhp8LhKGRg!Bk... ) for years with much success. However that project hasn't had a release in 2 years and the various Issues filed indicate that there might be problems with Rocky 9 (which we are looking to upgrade to). Do people that are at EL9 use NHC? Is there a fork? Is there a different code that people use for doing node health checks?

-Paul Edmon-

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Ole Holm Nielsen

22 Aug 22 Aug

1:17 p.m.

New subject: [EXTERNAL] Node Health Check Program

On 8/19/25 21:25, Jennings, Michael E via slurm-users wrote:

...

Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Due to Michael's recommendation I wanted to try out the 'dev' branch version 1.5 of NHC and build an RPM package referred to by Michael.

Since I'm not a software developer, I had to figure out for myself the detailed building steps - perhaps trivial to some of you, and stumbling blocks to others. This is what I came up with:

$ git clone https://github.com/mej/nhc.git $ cd nhc $ git switch dev # Switch to the 'dev' branch $ git status # Check the status $ grep nhc_version configure.ac # Verify the 'dev' version m4_define([nhc_version], [1.5]) $ ./autogen.sh # Undocumented build requirement $ cd .. $ mv nhc lbnl-nhc-1.5 # Rename the source folder $ tar czf lbnl-nhc-1.5.tar.gz lbnl-nhc-1.5 $ rpmbuild -ta lbnl-nhc-1.5.tar.gz

The resulting RPM package is:

~/rpmbuild/RPMS/noarch/lbnl-nhc-1.5-0.82.gf8dc.el8.noarch.rpm

I've added those steps to my Slurm Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#node-health-c...

Any comments?

Thanks, Ole

Otto, Frank

2:20 p.m.

New subject: [EXTERNAL] Node Health Check Program

Hi Ole,

this looks similar to what I've been doing for building RPMs. (It's documented for our in-house branch at [1], if anyone wants to compare.) Happy to see I'm not doing something totally stupid. :)

[1] https://github.com/UCL-ARC/nhc/blob/ucl/README.md

Thanks, Frank

-- Dr. Frank Otto Principal Research Infrastructure Developer Advanced Research Computing Centre Univesity College London, UK ________________________________ From: Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com Sent: 22 August 2025 12:17 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: [EXTERNAL] Node Health Check Program

⚠ Caution: External sender

On 8/19/25 21:25, Jennings, Michael E via slurm-users wrote:

...

Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Due to Michael's recommendation I wanted to try out the 'dev' branch version 1.5 of NHC and build an RPM package referred to by Michael.

Since I'm not a software developer, I had to figure out for myself the detailed building steps - perhaps trivial to some of you, and stumbling blocks to others. This is what I came up with:

The resulting RPM package is:

~/rpmbuild/RPMS/noarch/lbnl-nhc-1.5-0.82.gf8dc.el8.noarch.rpm

I've added those steps to my Slurm Wiki page: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik...https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#node-health-check

Any comments?

Thanks, Ole

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Jennings, Michael E

25 Aug 25 Aug

9:50 p.m.

New subject: [EXTERNAL] Node Health Check Program

Hi guys!

NHC builds like any other GNU Autotools-based package: ./autogen.sh <configure-args> && make dist

That's all you need to do to generate the correct tarball. From there, rpmbuild -ta is one option. I use Mezzanine tools, so I just run mzbuild. So whenever I go to build new RPMs for the production teams, all I have to do is "./autogen.sh && make dist && mzbuild"[1], or the (mostly) equivalent "./autogen.sh && make dist && rpmbuild -ta lbnl-nhc-1.5.tar.gz", either of which spits out the RPM and SRPM for me.

Hope this helps! Michael

1: Technically, this is a lie. What I actually run is this: "./autogen.sh && make distcheck && zbuild" The "check" part ensures that everything is set up correctly for out-of-tree builds, cross compiling, etc. But that's something only I really need to worry about. The zbuild command is actually from yet another project; it allows me to build for multiple distributions at once by leveraging containers. But again, not something the average person would care about.

________________________________ From: Otto, Frank via slurm-users slurm-users@lists.schedmd.com Sent: Friday, August 22, 2025 06:20 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: [EXTERNAL] Node Health Check Program

Hi Ole,

this looks similar to what I've been doing for building RPMs. (It's documented for our in-house branch at [1], if anyone wants to compare.) Happy to see I'm not doing something totally stupid. :)

[1] https://github.com/UCL-ARC/nhc/blob/ucl/README.md https://urldefense.com/v3/__https://github.com/UCL-ARC/nhc/blob/ucl/README.md__;!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWBQFk8CU$

Thanks, Frank

⚠ Caution: External sender

On 8/19/25 21:25, Jennings, Michael E via slurm-users wrote:

...

Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Due to Michael's recommendation I wanted to try out the 'dev' branch version 1.5 of NHC and build an RPM package referred to by Michael.

Since I'm not a software developer, I had to figure out for myself the detailed building steps - perhaps trivial to some of you, and stumbling blocks to others. This is what I came up with:

$ git clone https://github.com/mej/nhc.git https://urldefense.com/v3/__https://github.com/mej/nhc.git__;!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWN1YpmOU$ $ cd nhc $ git switch dev # Switch to the 'dev' branch $ git status # Check the status $ grep nhc_version configure.achttps://urldefense.com/v3/__http://configure.ac__;!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWOzoPssU$ # Verify the 'dev' version m4_define([nhc_version], [1.5]) $ ./autogen.shhttps://urldefense.com/v3/__http://autogen.sh__;!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWNNdMUg3$ # Undocumented build requirement $ cd .. $ mv nhc lbnl-nhc-1.5 # Rename the source folder $ tar czf lbnl-nhc-1.5.tar.gz lbnl-nhc-1.5 $ rpmbuild -ta lbnl-nhc-1.5.tar.gz

The resulting RPM package is:

~/rpmbuild/RPMS/noarch/lbnl-nhc-1.5-0.82.gf8dc.el8.noarch.rpm

I've added those steps to my Slurm Wiki page: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik...https://urldefense.com/v3/__https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/*node-health-check__;Iw!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWEotkh1J$

Any comments?

Thanks, Ole

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Ole Holm Nielsen

26 Aug 26 Aug

11:49 a.m.

New subject: [EXTERNAL] Node Health Check Program

Hi Michael,

Thanks a lot for the hints for building an NHC RPM. I confirm that this works for me.

As I wrote below "perhaps trivial to some of you, and stumbling blocks to others" :-) Could you kindly add your instructions to https://github.com/mej/nhc?tab=readme-ov-file#installation ?

IMHO, the below commands "git clone ... ; git switch dev" ought to be documented as well - again for the benefit of those of us not well versed in Git and GNU Autotools.

Best regards, Ole

On 8/25/25 21:50, Jennings, Michael E via slurm-users wrote:

...

Hi guys!

NHC builds like any other GNU Autotools-based package: |./autogen.sh <configure-args> && make dist|

That's all you need to do to generate the correct tarball. From there, | rpmbuild -ta |is one option. I use Mezzanine tools, so I just run mzbuild. So whenever I go to build new RPMs for the production teams, all I have to do is "|./autogen.sh && make dist && mzbuild"||[1], or the (mostly) equivalent "||./autogen.sh && make dist && rpmbuild -ta lbnl- nhc-1.5.tar.gz||", either of which spits out the RPM and SRPM for me.|

|Hope this helps!| |Michael|

|1: Technically, this is a lie. What I |*|actually|* run is this: "|./ autogen.sh && make distcheck && zbuild|" The "check" part ensures that everything is set up correctly for out-of-tree builds, cross compiling, etc. But that's something only I really need to worry about. The | zbuild| command is actually from yet another project; it allows me to build for multiple distributions at once by leveraging containers. But again, not something the average person would care about.

-- Michael E. Jennings (he/him) mej@lanl.gov https://hpc.lanl.gov/ HPC Platform Integration Engineer - Platforms Design Team - HPC Design Group Ultra-Scale Research Center (USRC), 4200 W Jemez #301-25 +1 (505) 412-4151 Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

*From:* Otto, Frank via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, August 22, 2025 06:20 *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: [EXTERNAL] Node Health Check Program Hi Ole,

this looks similar to what I've been doing for building RPMs. (It's documented for our in-house branch at [1], if anyone wants to compare.) Happy to see I'm not doing something totally stupid. :)

[1] https://github.com/UCL-ARC/nhc/blob/ucl/README.md <https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fgithub.com%2FUCL- ARC%2Fnhc%2Fblob%2Fucl%2FREADME.md__%3B!!Bt8fGhp8LhKGRg! Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWBQFk8CU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830057006%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=pYzn8PEVWUWdS4sb2pBaz0y0ST7180Io%2FFfHxDRHEdQ%3D&reserved=0>

Thanks, Frank

-- Dr. Frank Otto Principal Research Infrastructure Developer Advanced Research Computing Centre Univesity College London, UK

*From:* Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com *Sent:* 22 August 2025 12:17 *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: [EXTERNAL] Node Health Check Program ⚠ Caution: External sender

On 8/19/25 21:25, Jennings, Michael E via slurm-users wrote:

...
Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters

Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Due to Michael's recommendation I wanted to try out the 'dev' branch version 1.5 of NHC and build an RPM package referred to by Michael.

Since I'm not a software developer, I had to figure out for myself the detailed building steps - perhaps trivial to some of you, and stumbling blocks to others. This is what I came up with:

$ git clone https://github.com/mej/nhc.git <https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fgithub.com%2Fmej%2Fnhc.git__%3B!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWN1YpmOU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830077572%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=cWe5qCqgwew3oFovdQ%2Fif9Ap07NIftOOwrFEdkbHubY%3D&reserved=0> $ cd nhc $ git switch dev # Switch to the 'dev' branch $ git status # Check the status $ grep nhc_version configure.ac <https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fconfigure.ac__%3B!! Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWOzoPssU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830092237%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=ZcwAKVnyDcgzttEaDMdNB3rFdCKDmt75y8meq5DdsN8%3D&reserved=0> # Verify the 'dev' version m4_define([nhc_version], [1.5]) $ ./autogen.sh <https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fautogen.sh__%3B!! Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWNNdMUg3%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830106572%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=dpgWt79RIF96kI6kH%2BKGRuH4jD1t%2BymzwPXdPH4%2BJJs%3D&reserved=0> # Undocumented build requirement $ cd .. $ mv nhc lbnl-nhc-1.5 # Rename the source folder $ tar czf lbnl-nhc-1.5.tar.gz lbnl-nhc-1.5 $ rpmbuild -ta lbnl-nhc-1.5.tar.gz

The resulting RPM package is:

~/rpmbuild/RPMS/noarch/lbnl-nhc-1.5-0.82.gf8dc.el8.noarch.rpm

I've added those steps to my Slurm Wiki page: https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F%23node-health-check&data=05%7C02%7Cf.otto%40ucl.ac.uk%7C8865ec39af3241be6a7908dde16ed054%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C638914588236979158%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=bX%2FuNDPVHjspnWZ3c%2FA4CpW61xRHCfS8OmrdDkOG0CQ%3D&reserved=0 https://urldefense.com/v3/__https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/*node-health-check__;Iw!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWEotkh1J$

Any comments?

Timony, Mick

2:13 p.m.

New subject: [EXTERNAL] Node Health Check Program

Hi,

I found Michael presentation useful for helping me build the dev branch. There are brief instructions in page 15 which I modified to use the dev branch. I am willing to help test a new version time premitting.

https://hpckp.org/wp-content/uploads/2022/10/13-HPCKP6-Michael-Jennings.pdf

Kind regards,

-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School -- ________________________________ From: Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, August 26, 2025 5:49 AM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: [EXTERNAL] Node Health Check Program

Hi Michael,

Thanks a lot for the hints for building an NHC RPM. I confirm that this works for me.

As I wrote below "perhaps trivial to some of you, and stumbling blocks to others" :-) Could you kindly add your instructions to https://github.com/mej/nhc?tab=readme-ov-file#installation ?

IMHO, the below commands "git clone ... ; git switch dev" ought to be documented as well - again for the benefit of those of us not well versed in Git and GNU Autotools.

Best regards, Ole

On 8/25/25 21:50, Jennings, Michael E via slurm-users wrote:

...

Hi guys!

NHC builds like any other GNU Autotools-based package: |./autogen.sh <configure-args> && make dist|

That's all you need to do to generate the correct tarball. From there, | rpmbuild -ta |is one option. I use Mezzanine tools, so I just run mzbuild. So whenever I go to build new RPMs for the production teams, all I have to do is "|./autogen.sh && make dist && mzbuild"||[1], or the (mostly) equivalent "||./autogen.sh && make dist && rpmbuild -ta lbnl- nhc-1.5.tar.gz||", either of which spits out the RPM and SRPM for me.|

|Hope this helps!| |Michael|

|1: Technically, this is a lie. What I |*|actually|* run is this: "|./ autogen.sh && make distcheck && zbuild|" The "check" part ensures that everything is set up correctly for out-of-tree builds, cross compiling, etc. But that's something only I really need to worry about. The | zbuild| command is actually from yet another project; it allows me to build for multiple distributions at once by leveraging containers. But again, not something the average person would care about.

-- Michael E. Jennings (he/him) mej@lanl.gov https://hpc.lanl.gov/ HPC Platform Integration Engineer - Platforms Design Team - HPC Design Group Ultra-Scale Research Center (USRC), 4200 W Jemez #301-25 +1 (505) 412-4151 Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

*From:* Otto, Frank via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, August 22, 2025 06:20 *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: [EXTERNAL] Node Health Check Program Hi Ole,

this looks similar to what I've been doing for building RPMs. (It's documented for our in-house branch at [1], if anyone wants to compare.) Happy to see I'm not doing something totally stupid. :)

[1] https://github.com/UCL-ARC/nhc/blob/ucl/README.md https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fgithub.com%2FUCL- ARC%2Fnhc%2Fblob%2Fucl%2FREADME.md__%3B!!Bt8fGhp8LhKGRg! Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWBQFk8CU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830057006%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=pYzn8PEVWUWdS4sb2pBaz0y0ST7180Io%2FFfHxDRHEdQ%3D&reserved=0

Thanks, Frank

-- Dr. Frank Otto Principal Research Infrastructure Developer Advanced Research Computing Centre Univesity College London, UK

*From:* Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com *Sent:* 22 August 2025 12:17 *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: [EXTERNAL] Node Health Check Program ⚠ Caution: External sender

On 8/19/25 21:25, Jennings, Michael E via slurm-users wrote:

...
Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters

Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Due to Michael's recommendation I wanted to try out the 'dev' branch version 1.5 of NHC and build an RPM package referred to by Michael.

Since I'm not a software developer, I had to figure out for myself the detailed building steps - perhaps trivial to some of you, and stumbling blocks to others. This is what I came up with:

$ git clone https://github.com/mej/nhc.git https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fgithub.com%2Fmej%2Fnhc.git__%3B!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWN1YpmOU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830077572%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=cWe5qCqgwew3oFovdQ%2Fif9Ap07NIftOOwrFEdkbHubY%3D&reserved=0 $ cd nhc $ git switch dev # Switch to the 'dev' branch $ git status # Check the status $ grep nhc_version configure.ac https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fconfigure.ac__%3B!! Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWOzoPssU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830092237%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=ZcwAKVnyDcgzttEaDMdNB3rFdCKDmt75y8meq5DdsN8%3D&reserved=0 # Verify the 'dev' version m4_define([nhc_version], [1.5]) $ ./autogen.sh https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fautogen.sh__%3B!! Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWNNdMUg3%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830106572%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=dpgWt79RIF96kI6kH%2BKGRuH4jD1t%2BymzwPXdPH4%2BJJs%3D&reserved=0 # Undocumented build requirement $ cd .. $ mv nhc lbnl-nhc-1.5 # Rename the source folder $ tar czf lbnl-nhc-1.5.tar.gz lbnl-nhc-1.5 $ rpmbuild -ta lbnl-nhc-1.5.tar.gz

The resulting RPM package is:

~/rpmbuild/RPMS/noarch/lbnl-nhc-1.5-0.82.gf8dc.el8.noarch.rpm

I've added those steps to my Slurm Wiki page: https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F%23node-health-check&data=05%7C02%7Cf.otto%40ucl.ac.uk%7C8865ec39af3241be6a7908dde16ed054%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C638914588236979158%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=bX%2FuNDPVHjspnWZ3c%2FA4CpW61xRHCfS8OmrdDkOG0CQ%3D&reserved=0 https://urldefense.com/v3/__https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/*node-health-check__;Iw!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWEotkh1J$

Any comments?

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Ole Holm Nielsen

4:02 p.m.

New subject: [EXTERNAL] Node Health Check Program

Hi Mick,

Thanks for the info. I already discovered below how to build from the dev branch. I hope that Michael can add some build information to the NHC page https://github.com/mej/nhc?tab=readme-ov-file#installation

/Ole

On 8/26/25 14:13, Timony, Mick via slurm-users wrote:

...

Hi,

I found Michael presentation useful for helping me build the dev branch. There are brief instructions in page 15 which I modified to use the dev branch. I am willing to help test a new version time premitting.

https://hpckp.org/wp-content/uploads/2022/10/13-HPCKP6-Michael- Jennings.pdf <https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fhpckp.org%2Fwp-content%2Fuploads%2F2022%2F10%2F13- HPCKP6-Michael- Jennings.pdf&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7Cc32dd3aa6df54612839108dde49c4a40%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638918082128454215%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jbPNqc5cpuwj3xOtPIm07IuVnCkz2fq3YcEgBi1Rzns%3D&reserved=0>

Kind regards,

-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School

--

*From:* Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com *Sent:* Tuesday, August 26, 2025 5:49 AM *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: [EXTERNAL] Node Health Check Program Hi Michael,

Thanks a lot for the hints for building an NHC RPM. I confirm that this works for me.

As I wrote below "perhaps trivial to some of you, and stumbling blocks to others" :-) Could you kindly add your instructions to https://github.com/mej/nhc?tab=readme-ov-file#installation <https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fgithub.com%2Fmej%2Fnhc%3Ftab%3Dreadme-ov- file%23installation&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7Cc32dd3aa6df54612839108dde49c4a40%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638918082128490930%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=RDxZQnpKTmjN5%2F2RB7EDFkRKRN%2FnXwKu5vCzSdFnR3o%3D&reserved=0> ?

IMHO, the below commands "git clone ... ; git switch dev" ought to be documented as well - again for the benefit of those of us not well versed in Git and GNU Autotools.

Best regards, Ole

On 8/25/25 21:50, Jennings, Michael E via slurm-users wrote:

...
Hi guys!

NHC builds like any other GNU Autotools-based package: |./autogen.sh <configure-args> && make dist|

That's all you need to do to generate the correct tarball. From there, | rpmbuild -ta |is one option. I use Mezzanine tools, so I just run mzbuild. So whenever I go to build new RPMs for the production teams, all I have to do is "|./autogen.sh && make dist && mzbuild"||[1], or the (mostly) equivalent "||./autogen.sh && make dist && rpmbuild -ta lbnl- nhc-1.5.tar.gz||", either of which spits out the RPM and SRPM for me.|

|Hope this helps!| |Michael|

|1: Technically, this is a lie. What I |*|actually|* run is this: "|./ autogen.sh && make distcheck && zbuild|" The "check" part ensures that everything is set up correctly for out-of-tree builds, cross compiling, etc. But that's something only I really need to worry about. The | zbuild| command is actually from yet another project; it allows me to build for multiple distributions at once by leveraging containers. But again, not something the average person would care about.

-- Michael E. Jennings (he/him) mej@lanl.gov https://hpc.lanl.gov/ <https://eur01.safelinks.protection.outlook.com/?

url=https%3A%2F%2Fhpc.lanl.gov%2F&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7Cc32dd3aa6df54612839108dde49c4a40%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638918082128508864%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=CdvX%2FDaUZxIyVDaA3a2qOdGSD4UG7OptD6GNcyLRVoY%3D&reserved=0>

...
HPC Platform Integration Engineer - Platforms Design Team - HPC Design Group Ultra-Scale Research Center (USRC), 4200 W Jemez #301-25 +1 (505) 412-4151 Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

*From:* Otto, Frank via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, August 22, 2025 06:20 *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: [EXTERNAL] Node Health Check Program Hi Ole,

this looks similar to what I've been doing for building RPMs. (It's documented for our in-house branch at [1], if anyone wants to compare.) Happy to see I'm not doing something totally stupid. :)

[1] https://github.com/UCL-ARC/nhc/blob/ucl/README.md <https://

eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fgithub.com%2FUCL- ARC%2Fnhc%2Fblob%2Fucl%2FREADME.md&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7Cc32dd3aa6df54612839108dde49c4a40%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638918082128524273%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=GzreSzjTTD7pl9ocreJSULV2Wa3dZVPZCR33Zg8aWs0%3D&reserved=0> <https://

...
eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fgithub.com%2FUCL- ARC%2Fnhc%2Fblob%2Fucl%2FREADME.md__%3B!!Bt8fGhp8LhKGRg! Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWBQFk8CU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830057006%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=pYzn8PEVWUWdS4sb2pBaz0y0ST7180Io%2FFfHxDRHEdQ%3D&reserved=0>

Thanks, Frank

-- Dr. Frank Otto Principal Research Infrastructure Developer Advanced Research Computing Centre Univesity College London, UK

*From:* Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com *Sent:* 22 August 2025 12:17 *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: [EXTERNAL] Node Health Check Program ⚠ Caution: External sender

On 8/19/25 21:25, Jennings, Michael E via slurm-users wrote:

...
Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters

Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

Due to Michael's recommendation I wanted to try out the 'dev' branch version 1.5 of NHC and build an RPM package referred to by Michael.

Since I'm not a software developer, I had to figure out for myself the detailed building steps - perhaps trivial to some of you, and stumbling blocks to others. This is what I came up with:

$ git clone https://github.com/mej/nhc.git <https://

eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fgithub.com%2Fmej%2Fnhc.git&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7Cc32dd3aa6df54612839108dde49c4a40%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638918082128540173%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=2wvR07fscoiGU8xdStgW8ylh6UjVyoNOHQXEsT0LnUE%3D&reserved=0> <https://

...
eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fgithub.com%2Fmej%2Fnhc.git__%3B!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWN1YpmOU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830077572%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=cWe5qCqgwew3oFovdQ%2Fif9Ap07NIftOOwrFEdkbHubY%3D&reserved=0> $ cd nhc $ git switch dev # Switch to the 'dev' branch $ git status # Check the status $ grep nhc_version configure.ac <https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fconfigure.ac__%3B!! Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWOzoPssU%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830092237%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=ZcwAKVnyDcgzttEaDMdNB3rFdCKDmt75y8meq5DdsN8%3D&reserved=0> # Verify the 'dev' version m4_define([nhc_version], [1.5]) $ ./autogen.sh <https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__http%3A%2F%2Fautogen.sh__%3B!! Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWNNdMUg3%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7C1aaf58dc6c7b45fcc86808dde41389d6%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638917494830106572%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=dpgWt79RIF96kI6kH%2BKGRuH4jD1t%2BymzwPXdPH4%2BJJs%3D&reserved=0> # Undocumented build requirement $ cd .. $ mv nhc lbnl-nhc-1.5 # Rename the source folder $ tar czf lbnl-nhc-1.5.tar.gz lbnl-nhc-1.5 $ rpmbuild -ta lbnl-nhc-1.5.tar.gz

The resulting RPM package is:

~/rpmbuild/RPMS/noarch/lbnl-nhc-1.5-0.82.gf8dc.el8.noarch.rpm

I've added those steps to my Slurm Wiki page: https://eur01.safelinks.protection.outlook.com/? <https://

eur01.safelinks.protection.outlook.com/?>

...
url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F%23node-health-check&data=05%7C02%7Cf.otto%40ucl.ac.uk%7C8865ec39af3241be6a7908dde16ed054%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C638914588236979158%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=bX%2FuNDPVHjspnWZ3c%2FA4CpW61xRHCfS8OmrdDkOG0CQ%3D&reserved=0 <https://urldefense.com/v3/__https://wiki.fysik.dtu.dk/Niflheim_system/

Slurm_configuration/*node-health-check__;Iw!!Bt8fGhp8LhKGRg! Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G- aijV5f2Er5oJkEvwXkn7lZWEotkh1J$ <https:// eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F*node-health-check__%3BIw!!Bt8fGhp8LhKGRg!Eyq9zb9zvoBb5gvwaNtYh349GtBEk3URU13HkwJtgMOAtzseCrup2js9G-aijV5f2Er5oJkEvwXkn7lZWEotkh1J%24&data=05%7C02%7Cole.h.nielsen%40fysik.dtu.dk%7Cc32dd3aa6df54612839108dde49c4a40%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C638918082128559643%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=JuxNiuFsMRNU96sZD5KpoZASmIVfXlyNtuZfcR3gHQc%3D&reserved=0>>

...
Any comments?

Ole Holm Nielsen

27 Aug 27 Aug

1:52 p.m.

New subject: [EXTERNAL] Node Health Check Program

Related to the NHC "dev" branch (version 1.5) I've been looking at the issue https://github.com/mej/nhc/issues/165 which requires a workaround in nhc.conf where you have to explicitly define the Slurm resource manager by:

* | export NHC_RM=slurm

As it turns out, the nhc script correctly autodetects NHC_RM=slurm but fails to export this variable, so it's not available to the ONLINE_NODE/OFFLINE_NODE scripts.

The fix in the PR https://github.com/mej/nhc/pull/168 adds the required export statement to nhc. I hope this can get merged into the NHC "dev" branch sometime soon.

Best regards, Ole

On 8/25/25 21:50, Jennings, Michael E via slurm-users wrote:

...

NHC builds like any other GNU Autotools-based package: |./autogen.sh <configure-args> && make dist|

That's all you need to do to generate the correct tarball. From there, | rpmbuild -ta |is one option. I use Mezzanine tools, so I just run mzbuild. So whenever I go to build new RPMs for the production teams, all I have to do is "|./autogen.sh && make dist && mzbuild"||[1], or the (mostly) equivalent "||./autogen.sh && make dist && rpmbuild -ta lbnl- nhc-1.5.tar.gz||", either of which spits out the RPM and SRPM for me.|

Ole Holm Nielsen

22 Aug 22 Aug

2:13 p.m.

New subject: [EXTERNAL] Node Health Check Program

On 8/19/25 21:25, Jennings, Michael E via slurm-users wrote:

...

Have you by chance given the `dev` branch a try? All our production servers currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, have been for some time now, and it's been rock solid. Our RHEL-based clusters also use this version. Our HPE/Cray Shasta clusters, including our largest (classified) clusters Crossroads, Tycho, and Venado, use a variant. (Long story short, I've merged in all my changes into a separate branch, but the reverse is not yet true.) This variant is, at present, COS/SLES-specific, but it has quite a few useful additional checks (many of them Cray-centric) contributed by other LANL folks that I haven't had a chance to upstream yet.

I've built and tested the lbnl-nhc-1.5-0.82.gf8dc.el8.noarch RPM which is built as described in my previous mail. Unfortunately LBNL NHC version 1.5-0.82.gf8dc fails to recognize NHC_RM=slurm in nhc.conf. We get this error:

/usr/libexec/nhc/node-mark-online: Unsupported RM detected in /usr/libexec/nhc/node-mark-online: ""

I created a new issue https://github.com/mej/nhc/issues/165

We need this issue to be fixed before we can deploy NHC 1.5 :-(

Any suggestions?

Thanks, Ole

Ryan Novosielski

23 Aug 23 Aug

4:47 a.m.

New subject: [EXTERNAL] Node Health Check Program

On Aug 22, 2025, at 08:13, Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com wrote:

/usr/libexec/nhc/node-mark-online: Unsupported RM detected in /usr/libexec/nhc/node-mark-online: ""

I created a new issue https://github.com/mej/nhc/issues/165

We need this issue to be fixed before we can deploy NHC 1.5 :-(

Any suggestions?

Ole, I don’t know if this is a change from the 1.4.2 that we run, but we don’t have NHC_RM defined in our config. There’s an example of "NHC_RM=pbs”, but we have it commented out. This rings the faintest of bells, so you might check the documentation and see if it’s only required for certain resource managers/if the thing assumes Slurm to begin with?

-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski (he/him) - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'

186

Age (days ago)

194

Last active (days ago)

slurm-users@lists.schedmd.com

16 comments

8 participants

tags (0)

participants (8)

Jennings, Michael E
John Hearns
Ole Holm Nielsen
Otto, Frank
Paul Edmon
Ryan Novosielski
Timony, Mick
Valerio Bellizzomi