<div dir="ltr"><div>We usually we set up a reservation for maintenance.  This prevents jobs from starting if they are not expected to end before the reservation (maintenance) starts.</div><div>As Paul indicated, this causes nodes to become idle (and pending job queue to grow) as maintenance time approaches, but avoids requiring users to resubmit partially completed jobs, especially since many of our users do notbioe464-1v2y adequately checkpoint.  <br></div><div><br></div><div>Draining all of the nodes has the disadvantage of potentially increasing cluster idle time even more --- if your maximum walltime is 3 days and you start draining at T-3d, if all jobs on the nodes have walltime of at most 1d than cluster is completely idle at T-2d.  Which is fine if you can effect the maintenance then and end 2d early, but problematic if you can;t, as no jobs can run those 2 days.  With reservation, short jobs continue to run until reservation starts.</div><div><br></div><div>But draining nodes is useful when yuo can effect the maintenance early if nodes become available, and particularly in cases where only a limited number of nodes are involved.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 6, 2020 at 1:54 PM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu">pedmon@cfa.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  
  <div>

    <p>Because we want to maximize usage we actually have opted to just

      cancel all running jobs the day of.  We send out notification to

      all the users that this will happen.  We haven't really seen any

      complaints and we've been doing this for years.  At the start of

      the outage we set all partitions to down, then run a cancel over

      all the running jobs.  Pending jobs are left in place, and users

      are allowed to submit work during the outage and when we reopen

      everything gets going again.</p>

    <p>So there is a third option, though you have to accept that jobs

      will be cancelled to pull it off.</p>

    <p>-Paul Edmon-<br>

    </p>

    <div>On 8/6/2020 1:13 PM, Jason Simms wrote:<br>

    </div>

    <blockquote type="cite">

      
      <div dir="ltr">Hello all,

        <div><br>

        </div>

        <div>Later this month, I will have to bring down, patch, and

          reboot all nodes in our cluster for maintenance. The two

          options available to set nodes into a maintenance mode seem to

          be either: 1) creating a system-wide reservation, or 2)

          setting all nodes into a DRAIN state.</div>

        <div><br>

        </div>

        <div>I'm not sure it really matters either way, but is there any

          preference one way or the other? Any gotchas I should be aware

          of?</div>

        <div><br>

        </div>

        <div>Warmest regards,</div>

        <div>Jason<br clear="all">

          <div><br>

          </div>

          -- <br>

          <div dir="ltr">

            <div dir="ltr">

              <div>

                <div dir="ltr">

                  <div>

                    <div dir="ltr">

                      <div>

                        <div dir="ltr">

                          <div style="color:rgb(0,0,0);font-family:Helvetica;font-size:14px;margin:0px"><span style="color:rgb(130,36,51)"><font face="Century Gothic"><b>Jason L. Simms,

                                  Ph.D., M.P.H.</b></font></span></div>

                          <div style="color:rgb(0,0,0);font-family:Helvetica;font-size:14px;margin:0px"><font face="Century Gothic"><span>Manager of

                                Research and High-Performance Computing</span></font></div>

                          <div style="color:rgb(0,0,0);font-family:Helvetica;font-size:14px;margin:0px"><font face="Century Gothic"><span>XSEDE Campus

                                Champion<br>

                              </span><span style="color:gray">Lafayette

                                College<br>

                                Information Technology Services<br>

                                710 Sullivan Rd | Easton, PA 18042<br>

                                Office: 112 Skillman Library<br>

                                p: (610) 330-5632</span></font></div>

                        </div>

                      </div>

                    </div>

                  </div>

                </div>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

  </div>


</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Tom Payerle <br>DIT-ACIGS/Mid-Atlantic Crossroads        <a href="mailto:payerle@umd.edu" target="_blank">payerle@umd.edu</a><br></div><div>5825 University Research Park               (301) 405-6135<br></div><div dir="ltr">University of Maryland<br>College Park, MD 20740-3831<br></div></div></div></div></div></div>