[slurm-users] Reservation vs. Draining for Maintenance?

Paul Edmon pedmon at cfa.harvard.edu
Thu Aug 6 17:51:54 UTC 2020


Because we want to maximize usage we actually have opted to just cancel 
all running jobs the day of.  We send out notification to all the users 
that this will happen.  We haven't really seen any complaints and we've 
been doing this for years.  At the start of the outage we set all 
partitions to down, then run a cancel over all the running jobs.  
Pending jobs are left in place, and users are allowed to submit work 
during the outage and when we reopen everything gets going again.

So there is a third option, though you have to accept that jobs will be 
cancelled to pull it off.

-Paul Edmon-

On 8/6/2020 1:13 PM, Jason Simms wrote:
> Hello all,
>
> Later this month, I will have to bring down, patch, and reboot all 
> nodes in our cluster for maintenance. The two options available to set 
> nodes into a maintenance mode seem to be either: 1) creating a 
> system-wide reservation, or 2) setting all nodes into a DRAIN state.
>
> I'm not sure it really matters either way, but is there any preference 
> one way or the other? Any gotchas I should be aware of?
>
> Warmest regards,
> Jason
>
> -- 
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200806/2f0d423f/attachment-0001.htm>


More information about the slurm-users mailing list