[slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

Wed Nov 23 13:38:55 UTC 2022

Xavier,

You want to use the ResumeFailedProgram script.

We use a full cloud cluster and that is where we deal with things like 
this. It will get called if your ResumeProgram does not result in slurmd 
being available on the node in a timely manner (whatever the reason). 
Writing it yourself makes complete sense when you think about the uses. 
Originally, it would be something that could be called because a node 
has a hardware issue and would not start. In the ResumeFailProgram you 
could send an email letting an admin know about it.

For me, I completely delete the node resources and reset/recreate it. 
That addresses even a miffed software change.

Brian Andrus

On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote:
> Hello slurm-users,
> The question can be found in a similar fashion here: 
> https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system
>
>
>   Issue
>
>
>     Current behavior and problem description
>
> When a node fails to |POWER_UP|, it is marked |DOWN|. While this is a 
> great idea in general, this is not useful when working with |CLOUD| 
> nodes, because said |CLOUD| node is likely to be started on a 
> different machine and therefore to |POWER_UP| without issues. But 
> since the node is marked as down, that cloud resource is no longer 
> used and never started again until freed manually.
>
>
>     Wanted behavior
>
> Ideally slurm would not mark the node as |DOWN|, but just attempt to 
> start another. If that's not possible, automatically resuming |DOWN| 
> nodes would also be an option.
>
>
>     Question
>
> How can I prevent slurm from marking nodes that fail to |POWER_UP| as 
> |DOWN| or make slurm restore |DOWN| nodes automatically to prevent 
> slurm from forgetting cloud resources?
>
>
>   Attempts and Thoughts
>
>
>     ReturnToService
>
> I tried solving this using |ReturnToService| 
> <https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService> but 
> that didn't seem to solve my issue, since, if I understand it 
> correctly, that will only accept slurm nodes starting up by themselves 
> or manually not taking them in consideration when scheduling jobs 
> until they've been started.
>
>
>     SlurmctldParameters=idle_on_node_suspend
>
> While this is great and definitely helpful, it doesn't solve the issue 
> at hand since a node that failed during power up, is not suspended.
>
>
>     ResumeFailedProgram
>
> I considered using |ResumeFailedProgram| 
> <https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram>, but 
> it sounds odd that you have to write yourself a script for returning 
> your nodes to service if they fail on startup. This case sounds too 
> usual to not be implemented in slurm. However, this will be my next 
> attempt: Implement a script that calls for every given node
>
>     sudo scontrol update NodeName=$NODE_NAME state=RESUME
>     reason=FailedShutdown
>
>
>   Additional Information
>
> In the |POWER_UP| script I am terminating the server if the setup 
> fails for any reason and return an exit code unequal to 0.
>
> In our Cloud Scheduling 
> <https://slurm.schedmd.com/elastic_computing.html> instances are 
> created once they are needed and deleted once they are no longer 
> deleted. This means that slurm stores that a node is |DOWN| while no 
> real instance behind it exists anymore. If that node wouldn't be 
> marked |DOWN| and a job would be scheduled towards it at a later time, 
> it would simply start an instance and run on that new instance. I am 
> just stating this to be maximum explicit.
>
> Best regards,
> Xaver Stiensmeier
>
> PS: This is the first time I use the slurm-user list and I hope I am 
> not violating any rules with this question. Please let me know, if I do.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221123/a71e1133/attachment.htm>