[slurm-users] Reservation vs. Draining for Maintenance?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Thu Aug 6 18:03:59 UTC 2020
On 06-08-2020 19:13, Jason Simms wrote:
> Later this month, I will have to bring down, patch, and reboot all nodes
> in our cluster for maintenance. The two options available to set nodes
> into a maintenance mode seem to be either: 1) creating a system-wide
> reservation, or 2) setting all nodes into a DRAIN state.
>
> I'm not sure it really matters either way, but is there any preference
> one way or the other? Any gotchas I should be aware of?
I'd recommend using a reservation because you can define a specific
maintenance period way ahead of time. You ought to create the
reservation in advance, before the greatest MaxTime for all partitions
in slurm.conf, so that you won't have any remaining running jobs when
the reservation sets in. Jobs can then continue to run until the very
last minute!
I have some notes on reservations in
https://wiki.fysik.dtu.dk/niflheim/SLURM#resource-reservation
Draining nodes is a bad idea, IMHO, because you'll have a lot of drained
nodes from now and until your maintenance period, causing lost resources.
The way I prefer to do upgrades is actually neither 1) nor 2). I make
rolling (minor) upgrades of the compute node OS and firmware while the
cluster is in full production in order to avoid lost resources. I will
post my upgrade script to this list in a separate message.
/Ole
More information about the slurm-users
mailing list