<html><body><div style="color: rgb(33, 33, 33); background-color: rgb(255, 255, 255);" dir="auto">At risk of being a heretic, why not something like Ansible to handle this? Slurm "should" be able to do it but feels like a bit of a weird fit for the job.</div><div id="mail-editor-reference-message-container" dir="auto"><br><hr style="display:inline-block;width:98%" tabindex="-1"><div id="divRplyFwdMsg" style="font-size: 11pt;"><strong>From:</strong> slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Phil Chiu <whophilchiu@gmail.com><br><strong>Sent:</strong> Wednesday, 3 August 2022, 5:51 pm<br><strong>To:</strong> slurm-users@schedmd.com <slurm-users@schedmd.com><br><strong>Subject:</strong> [slurm-users] Rolling reboot with at most N machines down simultaneously?<br></div><br><div dir="ltr">Occasionally I need to all the compute nodes in my system. However, I have a parallel file system which is <i>converged</i>, i.e., each compute node contributes a disk to the file system. The file system can tolerate having N nodes down simultaneously.<div><br></div><div>Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously."</div><div><br></div><div>I have thought about the following options</div><div><ul><li>A mass scontrol reboot - It doesn't seem like there is the ability to control how many nodes are being rebooted at once.</li><li>A job array - Job arrays can be easily configured to allow at most N jobs to be running simultaneously. However, I would need each array task to execute on a specific node, which does not appear to be possible.</li><li>Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are running simultaneously. Perhaps a special partition is needed for this?</li></ul><div>Open to hearing any other ideas.</div><div><br></div><div>Thanks!</div><div></div></div><div><div dir="ltr" class="gmail_signature">Phil</div></div></div>
<br></div></body></html>