[slurm-users] node health check

Brian Johanson bjohanso at psc.edu
Tue Jan 31 14:38:10 UTC 2023


On 1/30/23 10:35 PM, Ratnasamy, Fritz wrote:
> Hi,
>
>  Currently, some of our nodes are overloaded. The nhc installed used 
> to check the load and drain the node when it is overloaded. However, 
> for the past few  days, it is not showing the state of the node. When 
> I run /usr/sbin/nhc manually, it says
> 20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online 
> mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online:  Not sure how to handle node state 
> "" on mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online:  Skipping  node 
> mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu> ( )
>
> It seems that it is not able to read the state of the node. I ran 
> scontrol show node mcn26
> NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
>    NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8
>
> Any idea what happened and why nhc is not reading the state of the 
> node anymore?


nhc is using the FQDN, slurm isn't (NodeHostName=mcn26), the query is 
failing.

We have a line 'export HOSTNAME=$(hostname -s)' in /etc/sysconfig/nhc


-b

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230131/37c38f66/attachment.htm>


More information about the slurm-users mailing list