[slurm-users] Rolling reboot with at most N machines down simultaneously?

Chris Samuel chris at csamuel.org
Fri Aug 5 06:03:49 UTC 2022

On 3/8/22 10:20 pm, Gerhard Strangar wrote:

> With a fake license called reboot?

It's a neat idea, but I think there is a catch:

* 3 jobs start, each taking 1 license
* Other reboot jobs are all blocked
* Running reboot jobs trigger node reboot
* Running reboot jobs end when either the script exits and slurmd cleans 
it up before the reboot kills it, or it gets killed as NODE_FAIL when 
the node has been unresponsive for too long and is marked as down
* Licenses for those jobs are released
* 3 more reboot jobs start whilst the original 3 are rebooting
* 6 nodes are now rebooting
* Filesystem fall down go boom
* Also your rebooted nodes are now drained as "Node unexpectedly rebooted"

I guess you could change your Slurm config to not mark nodes as down if 
they stop responding and make sure the job that's launched, but that 
feels wrong to me.

All the best,
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

