[slurm-users] Rolling reboot with at most N machines down simultaneously?

Thu Aug 4 05:11:28 UTC 2022

On 8/3/22 11:47 am, Benjamin Arntzen wrote:

> At risk of being a heretic, why not something like Ansible to handle this?

Nothing heretical about that, but for us the reason is that `scontrol 
reboot ASAP` is integrated nicely into the scheduling of jobs, we have 
health checks and node epilogs that can recognise certain conditions 
that require a node reboot (too many fragmented huge pages for instance) 
and can trigger that automatically without it disrupting scheduling of 
large jobs.

What used to happen was that when a node was rebooted Slurm would 
consider it indefinitely unavailable and so think it couldn't schedule a 
large job and instead pack in smaller jobs, pushing back the start time 
of the large job.

All the best,
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA