[slurm-users] node health check
Brian Johanson
bjohanso at psc.edu
Tue Jan 31 14:38:10 UTC 2023
On 1/30/23 10:35 PM, Ratnasamy, Fritz wrote:
> Hi,
>
> Currently, some of our nodes are overloaded. The nhc installed used
> to check the load and drain the node when it is overloaded. However,
> for the past few days, it is not showing the state of the node. When
> I run /usr/sbin/nhc manually, it says
> 20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online
> mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online: Not sure how to handle node state
> "" on mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online: Skipping node
> mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu> ( )
>
> It seems that it is not able to read the state of the node. I ran
> scontrol show node mcn26
> NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
> NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8
>
> Any idea what happened and why nhc is not reading the state of the
> node anymore?
nhc is using the FQDN, slurm isn't (NodeHostName=mcn26), the query is
failing.
We have a line 'export HOSTNAME=$(hostname -s)' in /etc/sysconfig/nhc
-b
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230131/37c38f66/attachment.htm>
More information about the slurm-users
mailing list