[slurm-users] Reservation vs. Draining for Maintenance?

Thomas M. Payerle payerle at umd.edu
Thu Aug 6 18:07:10 UTC 2020

We usually we set up a reservation for maintenance.  This prevents jobs
from starting if they are not expected to end before the reservation
(maintenance) starts.
As Paul indicated, this causes nodes to become idle (and pending job queue
to grow) as maintenance time approaches, but avoids requiring users to
resubmit partially completed jobs, especially since many of our users do
notbioe464-1v2y adequately checkpoint.

Draining all of the nodes has the disadvantage of potentially increasing
cluster idle time even more --- if your maximum walltime is 3 days and you
start draining at T-3d, if all jobs on the nodes have walltime of at most
1d than cluster is completely idle at T-2d.  Which is fine if you can
effect the maintenance then and end 2d early, but problematic if you can;t,
as no jobs can run those 2 days.  With reservation, short jobs continue to
run until reservation starts.

But draining nodes is useful when yuo can effect the maintenance early if
nodes become available, and particularly in cases where only a limited
number of nodes are involved.

On Thu, Aug 6, 2020 at 1:54 PM Paul Edmon <pedmon at cfa.harvard.edu> wrote:

> Because we want to maximize usage we actually have opted to just cancel
> all running jobs the day of.  We send out notification to all the users
> that this will happen.  We haven't really seen any complaints and we've
> been doing this for years.  At the start of the outage we set all
> partitions to down, then run a cancel over all the running jobs.  Pending
> jobs are left in place, and users are allowed to submit work during the
> outage and when we reopen everything gets going again.
> So there is a third option, though you have to accept that jobs will be
> cancelled to pull it off.
> -Paul Edmon-
> On 8/6/2020 1:13 PM, Jason Simms wrote:
> Hello all,
> Later this month, I will have to bring down, patch, and reboot all nodes
> in our cluster for maintenance. The two options available to set nodes into
> a maintenance mode seem to be either: 1) creating a system-wide
> reservation, or 2) setting all nodes into a DRAIN state.
> I'm not sure it really matters either way, but is there any preference one
> way or the other? Any gotchas I should be aware of?
> Warmest regards,
> Jason
> --
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632

Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroads        payerle at umd.edu
5825 University Research Park               (301) 405-6135
University of Maryland
College Park, MD 20740-3831
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200806/c8522911/attachment.htm>

More information about the slurm-users mailing list