Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is created on demand and therefore after a failure nothing stops the system to start the node again as it is a different instance.
I thought this would be enough, but apparently the node is still marked with "NOT_RESPONDING" which leads to slurm not trying to schedule on it.
After a while NOT_RESPONDING is removed, but I would like to move it directly from within my fail script if possible so that the node can return to service immediately and not be blocked by "NOT_RESPONDING".
Best regards, Xaver
I am wondering why my question (below) didn't catch anyone's attention. Just for me as a feedback. Is it unclear where my problem lies or is it clear, but no solution is known? I looked through the documentation and now searched the Slurm repository, but am still unable to clearly identify how to handle "NOT_RESPONDING".
I would really like to improve my question if necessary.
Best regards, Xaver
On 23.02.24 18:55, Xaver Stiensmeier wrote:
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is created on demand and therefore after a failure nothing stops the system to start the node again as it is a different instance.
I thought this would be enough, but apparently the node is still marked with "NOT_RESPONDING" which leads to slurm not trying to schedule on it.
After a while NOT_RESPONDING is removed, but I would like to move it directly from within my fail script if possible so that the node can return to service immediately and not be blocked by "NOT_RESPONDING".
Best regards, Xaver
Hi Xaver,
I found your thread while searching for a solution to the same issue with cloud nodes. In the past I have always used POWER_UP to get the node to register and clear the NOT_RESPONDING flag, but this necessarily creates an instance regardless of whether I need one. It turns out that updating with UNDRAIN accomplishes the same without booting an instance. Setting UNDRAIN allows the node to be scheduled, which causes the resume program to run and once booted and registered, NOT_RESPONDING is cleared.
Unfortunately, the node state still displays NOT_RESPONDING, so it still shows up in sinfo --dead and as far as I can tell there is no way to separate "will boot" from "won't boot" nodes. Clearly there is still some internal state there that does not appear to be user-visible, at least from scontrol show node. And if there is a way to administratively clear NOT_RESPONDING entirely, I have not found it. But hopefully this helps.
--nate
Hey Nate,
we actually fixed our underlying issue that caused the NOT_RESPONDING flag - on fails we automatically terminated the node manually instead of letting Slurm call the terminate script. That lead to Slurm believing the node should still be there when it was terminated already.
Therefore, we do not have the issue any more as we no longer see nodes with NOT_RESPONDING.
Nice to hear that you found a solution though.
Best, Xaver
On 19.09.24 15:04, nate--- via slurm-users wrote:
Hi Xaver,
I found your thread while searching for a solution to the same issue with cloud nodes. In the past I have always used POWER_UP to get the node to register and clear the NOT_RESPONDING flag, but this necessarily creates an instance regardless of whether I need one. It turns out that updating with UNDRAIN accomplishes the same without booting an instance. Setting UNDRAIN allows the node to be scheduled, which causes the resume program to run and once booted and registered, NOT_RESPONDING is cleared.
Unfortunately, the node state still displays NOT_RESPONDING, so it still shows up in sinfo --dead and as far as I can tell there is no way to separate "will boot" from "won't boot" nodes. Clearly there is still some internal state there that does not appear to be user-visible, at least from scontrol show node. And if there is a way to administratively clear NOT_RESPONDING entirely, I have not found it. But hopefully this helps.
--nate