[slurm-users] node health check
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jan 31 06:59:57 UTC 2023
On 1/31/23 04:35, Ratnasamy, Fritz wrote:
> Currently, some of our nodes are overloaded. The nhc installed used to
> check the load and drain the node when it is overloaded. However, for the
> past few days, it is not showing the state of the node. When I run
> /usr/sbin/nhc manually, it says
> 20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online
> mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online: Not sure how to handle node state ""
> on mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online: Skipping node mcn26.chicagobooth.edu
> <http://mcn26.chicagobooth.edu> ( )
>
> It seems that it is not able to read the state of the node. I ran scontrol
> show node mcn26
> NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
> NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8
>
> Any idea what happened and why nhc is not reading the state of the node
> anymore?
What's the complete output of "scontrol show node mcn26", especially the
State=... information?
Which version of NHC are you running?
/Ole
More information about the slurm-users
mailing list