[slurm-users] NHC and slurm
mej at lanl.gov
Wed Apr 21 01:17:57 UTC 2021
On Thursday, 15 April 2021, at 10:58:31 (-0300),
> I'm trying to setup NHC for our Slurm cluster, but I'm not
> getting it to work properly.
Just for future reference, NHC has its own mailing lists, and even
though your question does relate to Slurm tangentially, it's really an
NHC question, not a Slurm question. :-)
So I've set the Reply-To header to redirect to "nhc at lbl.gov" instead.
It's an open list, but I would still encourage you to consider
subscribing at https://groups.google.com/a/lbl.gov/g/nhc
> $ sudo nhc
> ERROR: nhc: Health check failed: check_ps_service: Service sshd (process sshd) owned by root not running
> I know sshd is running because I logged in this machine with ssh.
By itself, this doesn't guarantee that there is still an sshd running
as root. When you connect, the main root-owned sshd process forks off
a separate sshd which is owned by you. It's entirely possible for the
root-owned sshd to exit or crash without impacting existing SSH
sessions. Just to be pedantic. ;-)
> And `systemctl status sshd` shows it is active.
Now *that* is a horse of a different color! ;-) So clearly you're
correct; sshd is definitely running.
> Here's a sample of my nhc.conf:
> * || check_ps_service munged
> * || check_ps_service -u root sshd
> * || check_ps_service -u root ssh
> * || check_ps_service ssh
> * || check_ps_service sshd
Only the first 2 lines are correct. The 3rd and 4th lines would look
for "ssh" processes instead of "sshd" processes, and the 5th one would
misinterpret user-owned sshd processes as the main listening sshd
that's owned by root. Not good. ;-) Your config should have:
* || check_ps_service munged
* || check_ps_service -u root sshd
You can also add "-S" to each of those checks if you'd like NHC to
attempt to start the service for you automatically if it's found to
not be running. First, though, we need to figure out why the 2nd
check isn't exhibiting the desired behavior!
> If I run `sudo nhc -a` to run all the tests, it gives 4 errors about
> NHC can find munge running, so what's the problem with ssh? What am I
Well, if systemd reports the service as being active, it's definitely
running. So the check should pass unless there's something weird
The next step I'd recommend is to run either in Debug Mode (via -d) or
Trace Mode (via -x); either of those 2 options will show you
everything NHC is receiving back from the "ps" command it runs to
gather process data. In fact, when *I'm* troubleshooting a check,
I'll generally use *both*, and I also use "-e" so that I don't have to
wade through all the other stuff in the config. :-) So I'd do this:
nhc -x -d -l - -e 'check_ps_service -u root sshd'
Or if you prefer, you can send the output to a file by changing the
"-l -" to "-l <file>" instead. That output will show you all the
lines of "ps" output NHC is parsing through and should help to
determine what's going awry.
I should also note that I have never personally run NHC on Debian or
Ubuntu, so it's possible there's a bug lurking somewhere that I just
haven't run across yet....
Hope that helps; let me know what you find (over on nhc at lbl.gov)! :-)
Michael E. Jennings <mej at lanl.gov> - [PGPH: he/him/his/Mr] -- hpc.lanl.gov
HPC Systems Engineer -- Platforms Team -- HPC Systems Group (HPC-SYS)
Strategic Computing Complex, Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605
Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001
More information about the slurm-users