<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Rafael, <br>

    </p>

    <p>Most HPC centers have scheduled downtime on a regular basis.

      Typically it's one day a month, but I I know that at Argonne

      National Lab, which is a DOE Leadership Computing Facility that

      house some of the largest supercomputers in the world for use by a

      large number of scientists, they take their systems off-line every

      Monday for maintenance. <br>

    </p>

    <p>Having regularly scheduled maintenance outages is pretty much

      necessary for any large environment. Otherwise, the users never

      let you take the clusters offline for maintenance. Once the system

      is offline for a few hours, a task like upgrading Slurm is pretty

      easy. <br>

    </p>

    <p>When I worked in a smaller environment, I didn't have regularly

      scheduled outages, but due to the small size of the environment,

      it was easy for me to ask/tell the users I needed to take the

      cluster off-line  with a few days notice w/o any complaints from

      the users. In larger environments, you'll always get pushback,

      which is why creating a policy of regularly scheduled maintenance

      outages is necessary. <br>

    </p>

    <p>Prentice<br>

    </p>

    <div class="moz-cite-prefix">On 3/22/19 7:07 AM, Frava wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAFYhPDv1AagcOH+QgZ_XMOPTRHQzB5AJydbGZxOcJtsJBC_jSQ@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr">

          <div dir="ltr">

            <div>Hi all,</div>

            <div><br>

            </div>

            <div>I think it's not that easy to keep SLURM up to date in

              a cluster of more than 3k nodes with a lot of users. I

              mean, that cluster has only a little more than 2 years old

              and my today's submission got the JOBID 12711473, the

              queue has 9769 jobs (squeue | wc -l). In two years there

              were only two maintenances that impacted the users and

              each one was announced a few months prior. They told me

              that they actually plan to update SLURM but not until late

              2019 because they have other things to do before that.

              Also, I'm the only one asking for heterogeneous jobs...<br>

            </div>

            <div><br>

            </div>

            <div>Rafael.<br>

            </div>

          </div>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">Le jeu. 21 mars 2019 à 22:19,

          Prentice Bisbal <<a href="mailto:pbisbal@pppl.gov"

            target="_blank" moz-do-not-send="true">pbisbal@pppl.gov</a>>

          a écrit :<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On

          3/21/19 4:40 PM, Reuti wrote:<br>

          <br>

          >> Am 21.03.2019 um 16:26 schrieb Prentice Bisbal <<a

            href="mailto:pbisbal@pppl.gov" target="_blank"

            moz-do-not-send="true">pbisbal@pppl.gov</a>>:<br>

          >><br>

          >><br>

          >> On 3/20/19 1:58 PM, Christopher Samuel wrote:<br>

          >>> On 3/20/19 4:20 AM, Frava wrote:<br>

          >>><br>

          >>>> Hi Chris, thank you for the reply.<br>

          >>>> The team that manages that cluster is not

          very fond of upgrading SLURM, which I understand.<br>

          >> As a system admin who manages clusters myself, I

          don't understand this. Our job is to provide and maintain

          resources for our users. Part of that maintenance is to

          provide updates for security, performance, and functionality

          (new features) reasons. HPC has always been a leading-edge

          kind if field, so I feel this is even more important for HPC

          admins.<br>

          >><br>

          >> Yes, there can be issues caused by updates, but those

          can be with proper planning: Have a plan to do the actual

          upgrade, have a plan to test for issues, and have a plan to

          revert to an earlier version if issues are discovered. This is

          work, but it's really not all that much work, and this is

          exactly the work we are being paid to do as cluster admins.<br>

          > Besides the work on the side of the admins, also the

          users are involved: exchanging libraries also means to run the

          test suites of their applications again.<br>

          ><br>

          > -- Reuti<br>

          <br>

          That implies the users actually wrote test suites. ;-)<br>

          <br>

          <br>

          <br>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>