<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Xavier,</p>
<p>You want to use the ResumeFailedProgram script.</p>
<p>We use a full cloud cluster and that is where we deal with things
like this. It will get called if your ResumeProgram does not
result in slurmd being available on the node in a timely manner
(whatever the reason). Writing it yourself makes complete sense
when you think about the uses. Originally, it would be something
that could be called because a node has a hardware issue and would
not start. In the ResumeFailProgram you could send an email
letting an admin know about it. <br>
</p>
<p>For me, I completely delete the node resources and reset/recreate
it. That addresses even a miffed software change.</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 11/23/2022 5:11 AM, Xaver
Stiensmeier wrote:<br>
</div>
<blockquote type="cite"
cite="mid:ff743cc1-042a-ff4d-e5ba-3fd0d8ffef5b@gmx.de">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div class="s-prose js-post-body" itemprop="text"> Hello
slurm-users,</div>
<div class="s-prose js-post-body" itemprop="text">The question can
be found in a similar fashion here:
<a class="moz-txt-link-freetext"
href="https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system"
moz-do-not-send="true">https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system</a><br>
<h1>Issue</h1>
<h2>Current behavior and problem description</h2>
<p>When a node fails to <code>POWER_UP</code>, it is marked <code>DOWN</code>.
While this is a great idea in general, this is not useful when
working with <code>CLOUD</code> nodes, because said <code>CLOUD</code>
node is likely to be started on a different machine and
therefore to <code>POWER_UP</code> without issues. But since
the node is marked as down, that cloud resource is no longer
used and never started again until freed manually.</p>
<h2>Wanted behavior</h2>
<p>Ideally slurm would not mark the node as <code>DOWN</code>,
but just attempt to start another. If that's not possible,
automatically resuming <code>DOWN</code> nodes would also be
an option.</p>
<h2>Question</h2>
<p>How can I prevent slurm from marking nodes that fail to <code>POWER_UP</code>
as <code>DOWN</code> or make slurm restore <code>DOWN</code>
nodes automatically to prevent slurm from forgetting cloud
resources?</p>
<h1>Attempts and Thoughts</h1>
<h2>ReturnToService</h2>
<p>I tried solving this using <a
href="https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService"
rel="nofollow noreferrer" moz-do-not-send="true"><code>ReturnToService</code></a>
but that didn't seem to solve my issue, since, if I understand
it correctly, that will only accept slurm nodes starting up by
themselves or manually not taking them in consideration when
scheduling jobs until they've been started.</p>
<h2>SlurmctldParameters=idle_on_node_suspend</h2>
<p>While this is great and definitely helpful, it doesn't solve
the issue at hand since a node that failed during power up, is
not suspended.</p>
<h2>ResumeFailedProgram</h2>
<p>I considered using <a
href="https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram"
rel="nofollow noreferrer" moz-do-not-send="true"><code>ResumeFailedProgram</code></a>,
but it sounds odd that you have to write yourself a script for
returning your nodes to service if they fail on startup. This
case sounds too usual to not be implemented in slurm. However,
this will be my next attempt: Implement a script that calls
for every given node</p>
<blockquote>
<p>sudo scontrol update NodeName=$NODE_NAME state=RESUME
reason=FailedShutdown</p>
</blockquote>
<h1>Additional Information</h1>
<p>In the <code>POWER_UP</code> script I am terminating the
server if the setup fails for any reason and return an exit
code unequal to 0.</p>
<p>In our <a
href="https://slurm.schedmd.com/elastic_computing.html"
rel="nofollow noreferrer" moz-do-not-send="true">Cloud
Scheduling</a> instances are created once they are needed
and deleted once they are no longer deleted. This means that
slurm stores that a node is <code>DOWN</code> while no real
instance behind it exists anymore. If that node wouldn't be
marked <code>DOWN</code> and a job would be scheduled towards
it at a later time, it would simply start an instance and run
on that new instance. I am just stating this to be maximum
explicit.</p>
<p>Best regards,<br>
Xaver Stiensmeier</p>
<p>PS: This is the first time I use the slurm-user list and I
hope I am not violating any rules with this question. Please
let me know, if I do.<br>
</p>
</div>
</blockquote>
</body>
</html>