[slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

Wed Nov 23 13:11:09 UTC 2022

Hello slurm-users,
The question can be found in a similar fashion here:
https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system

  Issue

    Current behavior and problem description

    Wanted behavior

Ideally slurm would not mark the node as |DOWN|, but just attempt to
start another. If that's not possible, automatically resuming |DOWN|
nodes would also be an option.

    Question

  Attempts and Thoughts

    ReturnToService

I tried solving this using |ReturnToService|
<https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService> but that
didn't seem to solve my issue, since, if I understand it correctly, that
will only accept slurm nodes starting up by themselves or manually not
taking them in consideration when scheduling jobs until they've been
started.

    SlurmctldParameters=idle_on_node_suspend

While this is great and definitely helpful, it doesn't solve the issue
at hand since a node that failed during power up, is not suspended.

    ResumeFailedProgram

I considered using |ResumeFailedProgram|
<https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram>, but
it sounds odd that you have to write yourself a script for returning
your nodes to service if they fail on startup. This case sounds too
usual to not be implemented in slurm. However, this will be my next
attempt: Implement a script that calls for every given node

    sudo scontrol update NodeName=$NODE_NAME state=RESUME
    reason=FailedShutdown

  Additional Information

In the |POWER_UP| script I am terminating the server if the setup fails
for any reason and return an exit code unequal to 0.

In our Cloud Scheduling
<https://slurm.schedmd.com/elastic_computing.html> instances are created
once they are needed and deleted once they are no longer deleted. This
means that slurm stores that a node is |DOWN| while no real instance
behind it exists anymore. If that node wouldn't be marked |DOWN| and a
job would be scheduled towards it at a later time, it would simply start
an instance and run on that new instance. I am just stating this to be
maximum explicit.

Best regards,
Xaver Stiensmeier

PS: This is the first time I use the slurm-user list and I hope I am not
violating any rules with this question. Please let me know, if I do.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221123/eddf58af/attachment.htm>