<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>We haven't really had MPI ugliness with the latest versions. Plus
we've been rolling our own PMIx and building against that which
seems to have solved most of the cross compatibility issues.</p>
<p>-Paul Edmon-<br>
</p>
<div class="moz-cite-prefix">On 11/2/2020 10:38 AM, Fulcomer, Samuel
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAOORAuGAM4zpGqkasM8fmP1bVrnqP9+jEGsX83-8A9tb6gJGEw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Our strategy is a bit simpler. We're migrating
compute nodes to a new cluster running 20.x. This isn't an
upgrade. We'll keep the old slurmdbd running for at least enough
time to suck the remaining accounting data into XDMoD.
<div><br>
</div>
<div>The old cluster will keep running jobs until there are no
more to run. We'll drain and move nodes to the new cluster as
we start seeing more and more idle nodes in the old cluster.
This avoids MPI ugliness and we move directly to 20.x.<br>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Nov 2, 2020 at 9:28 AM
Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"
moz-do-not-send="true">pedmon@cfa.harvard.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>In general I would follow this:</p>
<p><a
href="https://slurm.schedmd.com/quickstart_admin.html#upgrade"
target="_blank" moz-do-not-send="true">https://slurm.schedmd.com/quickstart_admin.html#upgrade</a></p>
<p>Namely:</p>
<p>Almost every new major release of Slurm (e.g. 19.05.x to
20.02.x) involves changes to the state files with new data
structures, new options, etc. Slurm permits upgrades to a
new major release from the past two major releases, which
happen every nine months (e.g. 18.08.x or 19.05.x to
20.02.x) without loss of jobs or other state information.
State information from older versions will not be
recognized and will be discarded, resulting in loss of all
running and pending jobs. State files are <b>not</b>
recognized when downgrading (e.g. from 19.05.x to 18.08.x)
and will be discarded, resulting in loss of all running
and pending jobs. For this reason, creating backup copies
of state files (as described below) can be of value.
Therefore when upgrading Slurm (more precisely, the
slurmctld daemon), saving the <i>StateSaveLocation</i>
(as defined in <i>slurm.conf</i>) directory contents with
all state information is recommended. If you need to
downgrade, restoring that directory's contents will let
you recover the jobs. Jobs submitted under the new version
will not be in those state files, but it can let you
recover most jobs. An exception to this is that jobs may
be lost when installing new pre-release versions (e.g.
20.02.0-pre1 to 20.02.0-pre2). Developers will try to note
these cases in the NEWS file. Contents of major releases
are also described in the RELEASE_NOTES file.</p>
<p>So I wouldn't go directly to 20.x, instead I would go
from 17.x to 19.x and then to 20.x</p>
<p>-Paul Edmon-<br>
</p>
<div>On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">We're doing something similar. We're
continuing to run production on 17.x and have set up a
new server/cluster running 20.x for testing and MPI app
rebuilds.
<div><br>
</div>
<div>Our plan had been to add recently purchased nodes
to the new cluster, and at some point turn off
submission on the old cluster and switch everyone to
submission on the new cluster (new login/submission
hosts). That way previously submitted MPI apps would
continue to run properly. As the old cluster
partitions started to clear out we'd mark ranges of
nodes to drain and move them to the new cluster.</div>
<div><br>
</div>
<div>We've since decided to wait until January, when
we've scheduled some downtime. The process will remain
the same wrt moving nodes from the old cluster to the
new, _except_ that everything will be drained, so we
can move big blocks of nodes and avoid slurm.conf
Partition line ugliness.</div>
<div><br>
</div>
<div>We're starting with a fresh database to get rid of
the bug induced corruption that prevents GPUs from
being fenced with cgroups.</div>
<div><br>
</div>
<div>regards,</div>
<div>s</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Nov 2, 2020 at
8:28 AM navin srivastava <<a
href="mailto:navin.altair@gmail.com" target="_blank"
moz-do-not-send="true">navin.altair@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Dear All,<br>
<div><br>
</div>
<div>Currently we are running slurm version 17.11.x
and wanted to move to 20.x.</div>
<div><br>
</div>
<div>We are building the New server with Slurm 20.2
version and planning to upgrade the client nodes
from 17.x to 20.x.</div>
<div><br>
</div>
<div>wanted to check if we can upgrade the Client
from 17.x to 20.x directly or we need to go
through 17.x to 18.x and 19.x then 20.x</div>
<div><br>
</div>
<div>Regards</div>
<div>Navin.</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>