[slurm-users] Elastic Compute on Cloud - Error Handling

Felix Wolfheimer f.wolfheimer at googlemail.com
Sat Jul 28 12:32:28 MDT 2018


I'm experimenting with SLURM Elastic Compute on a cloud platform. I'm
facing the following situation: Let's say, SLURM requests that a compute
instance is started. The ResumeProgram tries to create the instance, but
doesn't succeed because the cloud provider can't provide the instance type
at this point in time (happens for example if a GPU instance is requested,
but the datacenter simply doesn't have the capacity to provide this
instance).
SLURM will mark the instance as "DOWN" and will not try again to request
it. For this scenario this behavior is not optimal. Instead of marking the
node DOWN and not trying to request it again after some time, I'd like that
slurmctld just forgets about the failure and tries again to start the node.
Is there any knob which can be used to achieve this behavior? Optimally,
the behavior might be triggered by the return code of the ResumeProgram,
e.g.,

return code=0 - Node is starting up
return code=1 - A permanent error has occurred, don't try again
return code=2 - A temporary failure has occurred. Try again later.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180728/03baa13f/attachment.html>


More information about the slurm-users mailing list