<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p><br>

    </p>

    <p>So an example of using slurm to reboot all nodes 3 at a time:<br>

    </p>

    <p>    sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {}</p>

    <p>If you want to get fancy, make a script that does the reboot and

      waits for the node to be back up before exiting and use that

      instead of the 'scontrol reboot' part.</p>

    <p>Brian Andrus<br>

    </p>

    <div class="moz-cite-prefix">On 8/3/2022 11:47 AM, Benjamin Arntzen

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:A3EFA1B7-4B25-8349-A998-AF50A2BA2BAB@hxcore.ol">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div style="color: rgb(33, 33, 33); background-color: rgb(255,

        255, 255);" dir="auto">At risk of being a heretic, why not

        something like Ansible to handle this? Slurm "should" be able to

        do it but feels like a bit of a weird fit for the job.</div>

      <div id="mail-editor-reference-message-container" dir="auto"><br>

        <hr style="display:inline-block;width:98%" tabindex="-1">

        <div id="divRplyFwdMsg" style="font-size: 11pt;"><strong>From:</strong>

          slurm-users <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> on

          behalf of Phil Chiu <a class="moz-txt-link-rfc2396E" href="mailto:whophilchiu@gmail.com"><whophilchiu@gmail.com></a><br>

          <strong>Sent:</strong> Wednesday, 3 August 2022, 5:51 pm<br>

          <strong>To:</strong> <a class="moz-txt-link-abbreviated" href="mailto:slurm-users@schedmd.com">slurm-users@schedmd.com</a>

          <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@schedmd.com"><slurm-users@schedmd.com></a><br>

          <strong>Subject:</strong> [slurm-users] Rolling reboot with at

          most N machines down simultaneously?<br>

        </div>

        <br>

        <div dir="ltr">Occasionally I need to all the compute nodes in

          my system. However, I have a parallel file system which is <i>converged</i>,

          i.e., each compute node contributes a disk to the file system.

          The file system can tolerate having N nodes down

          simultaneously.

          <div><br>

          </div>

          <div>Therefore my problem is this - "Reboot all nodes,

            permitting N nodes to be rebooting simultaneously."</div>

          <div><br>

          </div>

          <div>I have thought about the following options</div>

          <div>

            <ul>

              <li>A mass scontrol reboot - It doesn't seem like there is

                the ability to control how many nodes are being rebooted

                at once.</li>

              <li>A job array - Job arrays can be easily configured to

                allow at most N jobs to be running simultaneously.

                However, I would need each array task to execute on a

                specific node, which does not appear to be possible.</li>

              <li>Individual slurm jobs which reboot nodes - With a for

                loop, I could submit a reboot job for each node. But I'm

                not sure how to limit this so at most N jobs are running

                simultaneously. Perhaps a special partition is needed

                for this?</li>

            </ul>

            <div>Open to hearing any other ideas.</div>

            <div><br>

            </div>

            <div>Thanks!</div>

          </div>

          <div>

            <div dir="ltr" class="gmail_signature">Phil</div>

          </div>

        </div>

        <br>

      </div>

    </blockquote>

  </body>

</html>