[slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system
xaverstiensmeier at gmx.de
Wed Nov 23 13:11:09 UTC 2022
The question can be found in a similar fashion here:
Current behavior and problem description
When a node fails to |POWER_UP|, it is marked |DOWN|. While this is a
great idea in general, this is not useful when working with |CLOUD|
nodes, because said |CLOUD| node is likely to be started on a different
machine and therefore to |POWER_UP| without issues. But since the node
is marked as down, that cloud resource is no longer used and never
started again until freed manually.
Ideally slurm would not mark the node as |DOWN|, but just attempt to
start another. If that's not possible, automatically resuming |DOWN|
nodes would also be an option.
How can I prevent slurm from marking nodes that fail to |POWER_UP| as
|DOWN| or make slurm restore |DOWN| nodes automatically to prevent slurm
from forgetting cloud resources?
Attempts and Thoughts
I tried solving this using |ReturnToService|
<https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService> but that
didn't seem to solve my issue, since, if I understand it correctly, that
will only accept slurm nodes starting up by themselves or manually not
taking them in consideration when scheduling jobs until they've been
While this is great and definitely helpful, it doesn't solve the issue
at hand since a node that failed during power up, is not suspended.
I considered using |ResumeFailedProgram|
it sounds odd that you have to write yourself a script for returning
your nodes to service if they fail on startup. This case sounds too
usual to not be implemented in slurm. However, this will be my next
attempt: Implement a script that calls for every given node
sudo scontrol update NodeName=$NODE_NAME state=RESUME
In the |POWER_UP| script I am terminating the server if the setup fails
for any reason and return an exit code unequal to 0.
In our Cloud Scheduling
<https://slurm.schedmd.com/elastic_computing.html> instances are created
once they are needed and deleted once they are no longer deleted. This
means that slurm stores that a node is |DOWN| while no real instance
behind it exists anymore. If that node wouldn't be marked |DOWN| and a
job would be scheduled towards it at a later time, it would simply start
an instance and run on that new instance. I am just stating this to be
PS: This is the first time I use the slurm-user list and I hope I am not
violating any rules with this question. Please let me know, if I do.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users