<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="s-prose js-post-body" itemprop="text"> Hello
slurm-users,</div>
<div class="s-prose js-post-body" itemprop="text">The question can
be found in a similar fashion here:
<a class="moz-txt-link-freetext" href="https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system">https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system</a><br>
<h1>Issue</h1>
<h2>Current behavior and problem description</h2>
<p>When a node fails to <code>POWER_UP</code>, it is marked <code>DOWN</code>.
While this is a great idea in general, this is not useful when
working with <code>CLOUD</code> nodes, because said <code>CLOUD</code>
node is likely to be started on a different machine and
therefore to <code>POWER_UP</code> without issues. But since
the node is marked as down, that cloud resource is no longer
used and never started again until freed manually.</p>
<h2>Wanted behavior</h2>
<p>Ideally slurm would not mark the node as <code>DOWN</code>,
but just attempt to start another. If that's not possible,
automatically resuming <code>DOWN</code> nodes would also be an
option.</p>
<h2>Question</h2>
<p>How can I prevent slurm from marking nodes that fail to <code>POWER_UP</code>
as <code>DOWN</code> or make slurm restore <code>DOWN</code>
nodes automatically to prevent slurm from forgetting cloud
resources?</p>
<h1>Attempts and Thoughts</h1>
<h2>ReturnToService</h2>
<p>I tried solving this using <a
href="https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService"
rel="nofollow noreferrer"><code>ReturnToService</code></a> but
that didn't seem to solve my issue, since, if I understand it
correctly, that will only accept slurm nodes starting up by
themselves or manually not taking them in consideration when
scheduling jobs until they've been started.</p>
<h2>SlurmctldParameters=idle_on_node_suspend</h2>
<p>While this is great and definitely helpful, it doesn't solve
the issue at hand since a node that failed during power up, is
not suspended.</p>
<p></p>
<h2>ResumeFailedProgram</h2>
<p>I considered using <a
href="https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram"
rel="nofollow noreferrer"><code>ResumeFailedProgram</code></a>,
but it sounds odd that you have to write yourself a script for
returning your nodes to service if they fail on startup. This
case sounds too usual to not be implemented in slurm. However,
this will be my next attempt: Implement a script that calls for
every given node</p>
<blockquote>
<p>sudo scontrol update NodeName=$NODE_NAME state=RESUME
reason=FailedShutdown</p>
</blockquote>
<h1>Additional Information</h1>
<p>In the <code>POWER_UP</code> script I am terminating the
server if the setup fails for any reason and return an exit code
unequal to 0.</p>
<p>In our <a
href="https://slurm.schedmd.com/elastic_computing.html"
rel="nofollow noreferrer">Cloud Scheduling</a> instances are
created once they are needed and deleted once they are no longer
deleted. This means that slurm stores that a node is <code>DOWN</code>
while no real instance behind it exists anymore. If that node
wouldn't be marked <code>DOWN</code> and a job would be
scheduled towards it at a later time, it would simply start an
instance and run on that new instance. I am just stating this to
be maximum explicit.</p>
<p>Best regards,<br>
Xaver Stiensmeier</p>
<p>PS: This is the first time I use the slurm-user list and I hope
I am not violating any rules with this question. Please let me
know, if I do.<br>
</p>
</div>
<p></p>
</body>
</html>