Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING - slurm-users

23 Feb 2024


      Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it can
happen that slurm's resumeTimeout is reached and the node is therefore
powered down. We have set ReturnToService=2 in order to avoid the node
being marked down, because the instance behind that node is created on
demand and therefore after a failure nothing stops the system to start
the node again as it is a different instance.
I thought this would be enough, but apparently the node is still marked
with "NOT_RESPONDING" which leads to slurm not trying to schedule on it.
After a while NOT_RESPONDING is removed, but I would like to move it
directly from within my fail script if possible so that the node can
return to service immediately and not be blocked by "NOT_RESPONDING".
Best regards,
Xaver