<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<p>So an example of using slurm to reboot all nodes 3 at a time:<br>
</p>
<p> sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {}</p>
<p>If you want to get fancy, make a script that does the reboot and
waits for the node to be back up before exiting and use that
instead of the 'scontrol reboot' part.</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 8/3/2022 11:47 AM, Benjamin Arntzen
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:A3EFA1B7-4B25-8349-A998-AF50A2BA2BAB@hxcore.ol">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div style="color: rgb(33, 33, 33); background-color: rgb(255,
255, 255);" dir="auto">At risk of being a heretic, why not
something like Ansible to handle this? Slurm "should" be able to
do it but feels like a bit of a weird fit for the job.</div>
<div id="mail-editor-reference-message-container" dir="auto"><br>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" style="font-size: 11pt;"><strong>From:</strong>
slurm-users <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> on
behalf of Phil Chiu <a class="moz-txt-link-rfc2396E" href="mailto:whophilchiu@gmail.com"><whophilchiu@gmail.com></a><br>
<strong>Sent:</strong> Wednesday, 3 August 2022, 5:51 pm<br>
<strong>To:</strong> <a class="moz-txt-link-abbreviated" href="mailto:slurm-users@schedmd.com">slurm-users@schedmd.com</a>
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@schedmd.com"><slurm-users@schedmd.com></a><br>
<strong>Subject:</strong> [slurm-users] Rolling reboot with at
most N machines down simultaneously?<br>
</div>
<br>
<div dir="ltr">Occasionally I need to all the compute nodes in
my system. However, I have a parallel file system which is <i>converged</i>,
i.e., each compute node contributes a disk to the file system.
The file system can tolerate having N nodes down
simultaneously.
<div><br>
</div>
<div>Therefore my problem is this - "Reboot all nodes,
permitting N nodes to be rebooting simultaneously."</div>
<div><br>
</div>
<div>I have thought about the following options</div>
<div>
<ul>
<li>A mass scontrol reboot - It doesn't seem like there is
the ability to control how many nodes are being rebooted
at once.</li>
<li>A job array - Job arrays can be easily configured to
allow at most N jobs to be running simultaneously.
However, I would need each array task to execute on a
specific node, which does not appear to be possible.</li>
<li>Individual slurm jobs which reboot nodes - With a for
loop, I could submit a reboot job for each node. But I'm
not sure how to limit this so at most N jobs are running
simultaneously. Perhaps a special partition is needed
for this?</li>
</ul>
<div>Open to hearing any other ideas.</div>
<div><br>
</div>
<div>Thanks!</div>
</div>
<div>
<div dir="ltr" class="gmail_signature">Phil</div>
</div>
</div>
<br>
</div>
</blockquote>
</body>
</html>