[slurm-users] Rolling reboot with at most N machines down simultaneously?

Brian Andrus toomuchit at gmail.com
Wed Aug 3 21:20:40 UTC 2022

So an example of using slurm to reboot all nodes 3 at a time:

     sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {}

If you want to get fancy, make a script that does the reboot and waits 
for the node to be back up before exiting and use that instead of the 
'scontrol reboot' part.

Brian Andrus

On 8/3/2022 11:47 AM, Benjamin Arntzen wrote:
> At risk of being a heretic, why not something like Ansible to handle 
> this? Slurm "should" be able to do it but feels like a bit of a weird 
> fit for the job.
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
> of Phil Chiu <whophilchiu at gmail.com>
> *Sent:* Wednesday, 3 August 2022, 5:51 pm
> *To:* slurm-users at schedmd.com <slurm-users at schedmd.com>
> *Subject:* [slurm-users] Rolling reboot with at most N machines down 
> simultaneously?
> Occasionally I need to all the compute nodes in my system. However, I 
> have a parallel file system which is /converged/, i.e., each compute 
> node contributes a disk to the file system. The file system can 
> tolerate having N nodes down simultaneously.
> Therefore my problem is this - "Reboot all nodes, permitting N nodes 
> to be rebooting simultaneously."
> I have thought about the following options
>   * A mass scontrol reboot - It doesn't seem like there is the ability
>     to control how many nodes are being rebooted at once.
>   * A job array - Job arrays can be easily configured to allow at most
>     N jobs to be running simultaneously. However, I would need each
>     array task to execute on a specific node, which does not appear to
>     be possible.
>   * Individual slurm jobs which reboot nodes - With a for loop, I
>     could submit a reboot job for each node. But I'm not sure how to
>     limit this so at most N jobs are running simultaneously. Perhaps a
>     special partition is needed for this?
> Open to hearing any other ideas.
> Thanks!
> Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220803/390e2488/attachment.htm>

More information about the slurm-users mailing list