<div dir="ltr">Occasionally I need to all the compute nodes in my system. However, I have a parallel file system which is <i>converged</i>, i.e., each compute node contributes a disk to the file system. The file system can tolerate having N nodes down simultaneously.<div><br></div><div>Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously."</div><div><br></div><div>I have thought about the following options</div><div><ul><li>A mass scontrol reboot - It doesn't seem like there is the ability to control how many nodes are being rebooted at once.</li><li>A job array - Job arrays can be easily configured to allow at most N jobs to be running simultaneously. However, I would need each array task to execute on a specific node, which does not appear to be possible.</li><li>Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are running simultaneously. Perhaps a special partition is needed for this?</li></ul><div>Open to hearing any other ideas.</div><div><br></div><div>Thanks!</div><div></div></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Phil</div></div></div>