[slurm-users] Rolling reboot with at most N machines down simultaneously?
Phil Chiu
whophilchiu at gmail.com
Wed Aug 3 15:37:19 UTC 2022
Occasionally I need to all the compute nodes in my system. However, I have
a parallel file system which is *converged*, i.e., each compute node
contributes a disk to the file system. The file system can tolerate having
N nodes down simultaneously.
Therefore my problem is this - "Reboot all nodes, permitting N nodes to be
rebooting simultaneously."
I have thought about the following options
- A mass scontrol reboot - It doesn't seem like there is the ability to
control how many nodes are being rebooted at once.
- A job array - Job arrays can be easily configured to allow at most N
jobs to be running simultaneously. However, I would need each array task to
execute on a specific node, which does not appear to be possible.
- Individual slurm jobs which reboot nodes - With a for loop, I could
submit a reboot job for each node. But I'm not sure how to limit this so at
most N jobs are running simultaneously. Perhaps a special partition is
needed for this?
Open to hearing any other ideas.
Thanks!
Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220803/7c5b0443/attachment.htm>
More information about the slurm-users
mailing list