[slurm-users] node health check

Ratnasamy, Fritz fritz.ratnasamy at chicagobooth.edu
Tue Jan 31 03:35:55 UTC 2023


Hi,

 Currently, some of our nodes are overloaded. The nhc installed used to
check the load and drain the node when it is overloaded. However, for the
past few  days, it is not showing the state of the node. When I run
/usr/sbin/nhc manually, it says
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online
mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" on
mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu (
)

It seems that it is not able to read the state of the node. I ran scontrol
show node mcn26
NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
   NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8

Any idea what happened and why nhc is not reading the state of the node
anymore?
Best,


*Fritz Ratnasamy*

Data Scientist

Information Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230130/5adabee2/attachment.htm>


More information about the slurm-users mailing list