[slurm-users] node health check

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Jan 31 06:59:57 UTC 2023


On 1/31/23 04:35, Ratnasamy, Fritz wrote:
>   Currently, some of our nodes are overloaded. The nhc installed used to 
> check the load and drain the node when it is overloaded. However, for the 
> past few  days, it is not showing the state of the node. When I run 
> /usr/sbin/nhc manually, it says
> 20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online 
> mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" 
> on mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu 
> <http://mcn26.chicagobooth.edu> ( )
> 
> It seems that it is not able to read the state of the node. I ran scontrol 
> show node mcn26
> NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
>     NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8
> 
> Any idea what happened and why nhc is not reading the state of the node 
> anymore?

What's the complete output of "scontrol show node mcn26", especially the 
State=... information?

Which version of NHC are you running?

/Ole







More information about the slurm-users mailing list