[slurm-users] Elastic Compute on Cloud - Error Handling

Lachlan Musicman datakid at gmail.com
Sat Jul 28 19:58:54 MDT 2018


On 29 July 2018 at 04:32, Felix Wolfheimer <f.wolfheimer at googlemail.com>
wrote:

> I'm experimenting with SLURM Elastic Compute on a cloud platform. I'm
> facing the following situation: Let's say, SLURM requests that a compute
> instance is started. The ResumeProgram tries to create the instance, but
> doesn't succeed because the cloud provider can't provide the instance type
> at this point in time (happens for example if a GPU instance is
> requested, but the datacenter simply doesn't have the capacity to provide
> this instance).
> SLURM will mark the instance as "DOWN" and will not try again to request
> it. For this scenario this behavior is not optimal. Instead of marking the
> node DOWN and not trying to request it again after some time, I'd like that
> slurmctld just forgets about the failure and tries again to start the
> node. Is there any knob which can be used to achieve this behavior?
> Optimally, the behavior might be triggered by the return code of the
> ResumeProgram, e.g.,
>
> return code=0 - Node is starting up
> return code=1 - A permanent error has occurred, don't try again
> return code=2 - A temporary failure has occurred. Try again later.
>
>

I don't have an answer to your question - but I would like to know how you
manage injecting the hostname and/or IP address into slurm.conf and then
distribute it in this situation?

I have read the documentation, but it doesn't indicate a best practice in
this scenario iirc.

Is it as simple as doing those steps - wait for boot, grab hostname, inject
into slurm.conf, distribute slurm.conf to nodes, restart slurm?

Cheers
L.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180729/3446238d/attachment.html>


More information about the slurm-users mailing list