[slurm-users] nodes going to down* and getting stuck in that state

Thu May 20 04:15:11 UTC 2021

Hi,

We have a cluster (in Google gcp) which has a few partitions set up to 
auto-scale, but one partition is set up to not autoscale. The desired 
state is for all of the nodes in this non-autoscaled partition 
(SuspendExcParts=gpu-t4-4x-ondemand) to continue running uninterrupted.  
However, we are finding that nodes periodically end up in the down* 
state and that we cannot get them back into a usable state.  This is 
using slurm 19.05.7

We have a script that runs periodically and checks the state of the 
nodes and takes action based on the state.  If the node is in a down 
state, then it gets terminated and if successfully terminated its state 
is set to power_down.  There is a short 1 second pause and then for 
those nodes that are in the POWERING_DOWN and not drained state they are 
set to RESUME.

Sometimes after we start up the node and it's running slurmd we cannot 
get some of these nodes back into a usable slurm state even after 
manually fiddling with its state.   It seems to go between idle* and 
down*.  But the node is there and we can log into it.

Does anyone have an idea of what might be going on?  And what we can do 
to get these nodes back into a usable (I guess "idle") state?

Thanks,

Herc