[slurm-users] Rolling reboot with at most N machines down simultaneously?
chris at csamuel.org
Fri Aug 5 06:03:49 UTC 2022
On 3/8/22 10:20 pm, Gerhard Strangar wrote:
> With a fake license called reboot?
It's a neat idea, but I think there is a catch:
* 3 jobs start, each taking 1 license
* Other reboot jobs are all blocked
* Running reboot jobs trigger node reboot
* Running reboot jobs end when either the script exits and slurmd cleans
it up before the reboot kills it, or it gets killed as NODE_FAIL when
the node has been unresponsive for too long and is marked as down
* Licenses for those jobs are released
* 3 more reboot jobs start whilst the original 3 are rebooting
* 6 nodes are now rebooting
* Filesystem fall down go boom
* Also your rebooted nodes are now drained as "Node unexpectedly rebooted"
I guess you could change your Slurm config to not mark nodes as down if
they stop responding and make sure the job that's launched, but that
feels wrong to me.
All the best,
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users