[slurm-users] Rolling reboot with at most N machines down simultaneously?
chris at csamuel.org
Thu Aug 4 05:08:16 UTC 2022
On 8/3/22 8:37 am, Phil Chiu wrote:
> Therefore my problem is this - "Reboot all nodes, permitting N nodes to
> be rebooting simultaneously."
I think currently the only way to do that would be to have a script that
* issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes
* wait for 1 to come back to being online
* issue an `scontrol reboot` for another node
* wait for 1 more to come back
* lather rinse repeat.
This does assume you've got your nodes configured to come back cleanly
on a reboot with slurmd up and no manual intervention required (which is
what we do).
All the best,
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users