[slurm-users] Rolling reboot with at most N machines down simultaneously?

Christopher Samuel chris at csamuel.org
Thu Aug 4 05:08:16 UTC 2022


On 8/3/22 8:37 am, Phil Chiu wrote:

> Therefore my problem is this - "Reboot all nodes, permitting N nodes to 
> be rebooting simultaneously."

I think currently the only way to do that would be to have a script that 
does:

* issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes
* wait for 1 to come back to being online
* issue an `scontrol reboot` for another node
* wait for 1 more to come back
* lather rinse repeat.

This does assume you've got your nodes configured to come back cleanly 
on a reboot with slurmd up and no manual intervention required (which is 
what we do).

All the best,
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




More information about the slurm-users mailing list