<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Xavier,</p>

    <p>You want to use the ResumeFailedProgram script.</p>

    <p>We use a full cloud cluster and that is where we deal with things

      like this. It will get called if your ResumeProgram does not

      result in slurmd being available on the node in a timely manner

      (whatever the reason). Writing it yourself makes complete sense

      when you think about the uses. Originally, it would be something

      that could be called because a node has a hardware issue and would

      not start. In the ResumeFailProgram you could send an email

      letting an admin know about it. <br>

    </p>

    <p>For me, I completely delete the node resources and reset/recreate

      it. That addresses even a miffed software change.</p>

    <p>Brian Andrus<br>

    </p>

    <div class="moz-cite-prefix">On 11/23/2022 5:11 AM, Xaver

      Stiensmeier wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:ff743cc1-042a-ff4d-e5ba-3fd0d8ffef5b@gmx.de">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div class="s-prose js-post-body" itemprop="text"> Hello

        slurm-users,</div>

      <div class="s-prose js-post-body" itemprop="text">The question can

        be found in a similar fashion here:

        <a class="moz-txt-link-freetext"

href="https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system"

          moz-do-not-send="true">https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a-cloud-scheduling-system</a><br>

        <h1>Issue</h1>

        <h2>Current behavior and problem description</h2>

        <p>When a node fails to <code>POWER_UP</code>, it is marked <code>DOWN</code>.

          While this is a great idea in general, this is not useful when

          working with <code>CLOUD</code> nodes, because said <code>CLOUD</code>

          node is likely to be started on a different machine and

          therefore to <code>POWER_UP</code> without issues. But since

          the node is marked as down, that cloud resource is no longer

          used and never started again until freed manually.</p>

        <h2>Wanted behavior</h2>

        <p>Ideally slurm would not mark the node as <code>DOWN</code>,

          but just attempt to start another. If that's not possible,

          automatically resuming <code>DOWN</code> nodes would also be

          an option.</p>

        <h2>Question</h2>

        <p>How can I prevent slurm from marking nodes that fail to <code>POWER_UP</code>

          as <code>DOWN</code> or make slurm restore <code>DOWN</code>

          nodes automatically to prevent slurm from forgetting cloud

          resources?</p>

        <h1>Attempts and Thoughts</h1>

        <h2>ReturnToService</h2>

        <p>I tried solving this using <a

            href="https://slurm.schedmd.com/slurm.conf.html#OPT_ReturnToService"

            rel="nofollow noreferrer" moz-do-not-send="true"><code>ReturnToService</code></a>

          but that didn't seem to solve my issue, since, if I understand

          it correctly, that will only accept slurm nodes starting up by

          themselves or manually not taking them in consideration when

          scheduling jobs until they've been started.</p>

        <h2>SlurmctldParameters=idle_on_node_suspend</h2>

        <p>While this is great and definitely helpful, it doesn't solve

          the issue at hand since a node that failed during power up, is

          not suspended.</p>

        <h2>ResumeFailedProgram</h2>

        <p>I considered using <a

            href="https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeFailProgram"

            rel="nofollow noreferrer" moz-do-not-send="true"><code>ResumeFailedProgram</code></a>,

          but it sounds odd that you have to write yourself a script for

          returning your nodes to service if they fail on startup. This

          case sounds too usual to not be implemented in slurm. However,

          this will be my next attempt: Implement a script that calls

          for every given node</p>

        <blockquote>

          <p>sudo scontrol update NodeName=$NODE_NAME state=RESUME

            reason=FailedShutdown</p>

        </blockquote>

        <h1>Additional Information</h1>

        <p>In the <code>POWER_UP</code> script I am terminating the

          server if the setup fails for any reason and return an exit

          code unequal to 0.</p>

        <p>In our <a

            href="https://slurm.schedmd.com/elastic_computing.html"

            rel="nofollow noreferrer" moz-do-not-send="true">Cloud

            Scheduling</a> instances are created once they are needed

          and deleted once they are no longer deleted. This means that

          slurm stores that a node is <code>DOWN</code> while no real

          instance behind it exists anymore. If that node wouldn't be

          marked <code>DOWN</code> and a job would be scheduled towards

          it at a later time, it would simply start an instance and run

          on that new instance. I am just stating this to be maximum

          explicit.</p>

        <p>Best regards,<br>

          Xaver Stiensmeier</p>

        <p>PS: This is the first time I use the slurm-user list and I

          hope I am not violating any rules with this question. Please

          let me know, if I do.<br>

        </p>

      </div>

    </blockquote>

  </body>

</html>