On 06-08-2020 19:13, Jason Simms wrote:
> Later this month, I will have to bring down, patch, and reboot all nodes 
> in our cluster for maintenance. The two options available to set nodes 
> into a maintenance mode seem to be either: 1) creating a system-wide 
> reservation, or 2) setting all nodes into a DRAIN state.
> I'm not sure it really matters either way, but is there any preference 
> one way or the other? Any gotchas I should be aware of?

I'd recommend using a reservation because you can define a specific 
maintenance period way ahead of time.  You ought to create the 
reservation in advance, before the greatest MaxTime for all partitions 
in slurm.conf, so that you won't have any remaining running jobs when 
the reservation sets in.  Jobs can then continue to run until the very 
last minute!

I have some notes on reservations in

Draining nodes is a bad idea, IMHO, because you'll have a lot of drained 
nodes from now and until your maintenance period, causing lost resources.

The way I prefer to do upgrades is actually neither 1) nor 2).  I make 
rolling (minor) upgrades of the compute node OS and firmware while the 
cluster is in full production in order to avoid lost resources.  I will 
post my upgrade script to this list in a separate message.


