[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

29 Feb 2024


      I am wondering why my question (below) didn't catch anyone's attention.
Just for me as a feedback. Is it unclear where my problem lies or is it
clear, but no solution is known? I looked through the documentation and
now searched the Slurm repository, but am still unable to clearly
identify how to handle "NOT_RESPONDING".
I would really like to improve my question if necessary.
Best regards,
Xaver
On 23.02.24 18:55, Xaver Stiensmeier wrote:
...
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it
can happen that slurm's resumeTimeout is reached and the node is
therefore powered down. We have set ReturnToService=2 in order to
avoid the node being marked down, because the instance behind that
node is created on demand and therefore after a failure nothing stops
the system to start the node again as it is a different instance.
I thought this would be enough, but apparently the node is still
marked with "NOT_RESPONDING" which leads to slurm not trying to
schedule on it.
After a while NOT_RESPONDING is removed, but I would like to move it
directly from within my fail script if possible so that the node can
return to service immediately and not be blocked by "NOT_RESPONDING".
Best regards,
Xaver

2026

2025

2024

[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING