[slurm-users] Rolling reboot with at most N machines down simultaneously?
Christopher Samuel
chris at csamuel.org
Thu Aug 4 05:11:28 UTC 2022
On 8/3/22 11:47 am, Benjamin Arntzen wrote:
> At risk of being a heretic, why not something like Ansible to handle this?
Nothing heretical about that, but for us the reason is that `scontrol
reboot ASAP` is integrated nicely into the scheduling of jobs, we have
health checks and node epilogs that can recognise certain conditions
that require a node reboot (too many fragmented huge pages for instance)
and can trigger that automatically without it disrupting scheduling of
large jobs.
What used to happen was that when a node was rebooted Slurm would
consider it indefinitely unavailable and so think it couldn't schedule a
large job and instead pack in smaller jobs, pushing back the start time
of the large job.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list