<div dir="ltr">Our strategy is a bit simpler. We're migrating compute nodes to a new cluster running 20.x. This isn't an upgrade. We'll keep the old slurmdbd running for at least enough time to suck the remaining accounting data into XDMoD.<div><br></div><div>The old cluster will keep running jobs until there are no more to run. We'll drain and move nodes to the new cluster as we start seeing more and more idle nodes in the old cluster. This avoids MPI ugliness and we move directly to 20.x.<br><div><br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu">pedmon@cfa.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>In general  I would follow this:</p>

    <p><a href="https://slurm.schedmd.com/quickstart_admin.html#upgrade" target="_blank">https://slurm.schedmd.com/quickstart_admin.html#upgrade</a></p>

    <p>Namely:</p>

    <p>Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x)

      involves changes to the state files with new data structures, new

      options, etc.

      Slurm permits upgrades to a new major release from the past two

      major releases,

      which happen every nine months (e.g. 18.08.x or 19.05.x to

      20.02.x) without loss of

      jobs or other state information.

      State information from older versions will not be recognized and

      will be

      discarded, resulting in loss of all running and pending jobs.

      State files are <b>not</b> recognized when downgrading (e.g. from

      19.05.x to 18.08.x)

      and will be discarded, resulting in loss of all running and

      pending jobs.

      For this reason, creating backup copies of state files (as

      described below)

      can be of value.

      Therefore when upgrading Slurm (more precisely, the slurmctld

      daemon),

      saving the <i>StateSaveLocation</i> (as defined in <i>slurm.conf</i>)

      directory contents with all state information is recommended.

      If you need to downgrade, restoring that directory's contents will

      let you

      recover the jobs.

      Jobs submitted under the new version will not be in those state

      files,

      but it can let you recover most jobs.

      An exception to this is that jobs may be lost when installing new

      pre-release

      versions (e.g. 20.02.0-pre1 to 20.02.0-pre2).

      Developers will try to note these cases in the NEWS file.

      Contents of major releases are also described in the RELEASE_NOTES

      file.</p>

    <p>So I wouldn't go directly to 20.x, instead I would go from 17.x

      to 19.x and then to 20.x</p>

    <p>-Paul Edmon-<br>

    </p>

    <div>On 11/2/2020 8:55 AM, Fulcomer, Samuel

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">We're doing something similar. We're continuing to

        run production on 17.x and have set up a new server/cluster 

        running 20.x for testing and MPI app rebuilds.

        <div><br>

        </div>

        <div>Our plan had been to add recently purchased nodes to the

          new cluster, and at some point turn off submission on the old

          cluster and switch everyone to  submission on the new cluster

          (new login/submission hosts). That way previously submitted

          MPI apps would continue to run properly. As the old cluster

          partitions started to clear out we'd mark ranges of nodes to

          drain and move them to the new cluster.</div>

        <div><br>

        </div>

        <div>We've since decided to wait until January, when we've

          scheduled some downtime. The process will remain the same wrt

          moving nodes from the old cluster to the new, _except_ that

          everything will be drained, so we can move big blocks of nodes

          and avoid slurm.conf Partition line ugliness.</div>

        <div><br>

        </div>

        <div>We're starting with a fresh database to get rid of the bug

          induced corruption that prevents GPUs from being fenced with

          cgroups.</div>

        <div><br>

        </div>

        <div>regards,</div>

        <div>s</div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Mon, Nov 2, 2020 at 8:28 AM

          navin srivastava <<a href="mailto:navin.altair@gmail.com" target="_blank">navin.altair@gmail.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div dir="ltr">Dear All,<br>

            <div><br>

            </div>

            <div>Currently we are running slurm version 17.11.x and

              wanted to move to 20.x.</div>

            <div><br>

            </div>

            <div>We are building the New server with Slurm 20.2 version

              and planning to upgrade the client nodes from 17.x to

              20.x.</div>

            <div><br>

            </div>

            <div>wanted to check if we can upgrade the Client from 17.x

              to 20.x directly or we need to go through 17.x to 18.x and

              19.x then 20.x</div>

            <div><br>

            </div>

            <div>Regards</div>

            <div>Navin.</div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </div>

</blockquote></div>