Hey Nate,
we actually fixed our underlying issue that caused the NOT_RESPONDING flag - on fails we automatically terminated the node manually instead of letting Slurm call the terminate script. That lead to Slurm believing the node should still be there when it was terminated already.
Therefore, we do not have the issue any more as we no longer see nodes with NOT_RESPONDING.
Nice to hear that you found a solution though.
Best, Xaver
On 19.09.24 15:04, nate--- via slurm-users wrote:
Hi Xaver,
I found your thread while searching for a solution to the same issue with cloud nodes. In the past I have always used POWER_UP to get the node to register and clear the NOT_RESPONDING flag, but this necessarily creates an instance regardless of whether I need one. It turns out that updating with UNDRAIN accomplishes the same without booting an instance. Setting UNDRAIN allows the node to be scheduled, which causes the resume program to run and once booted and registered, NOT_RESPONDING is cleared.
Unfortunately, the node state still displays NOT_RESPONDING, so it still shows up in sinfo --dead and as far as I can tell there is no way to separate "will boot" from "won't boot" nodes. Clearly there is still some internal state there that does not appear to be user-visible, at least from scontrol show node. And if there is a way to administratively clear NOT_RESPONDING entirely, I have not found it. But hopefully this helps.
--nate