<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>We don't follow the recommended procedure here but rather build
RPMs and upgrade using those. We haven't and any issues. Here is
our procedure:<br>
</p>
<p>1. Build rpms from source using a version of the slurm.spec file
that we maintain. It's the version SchedMD provides but modified
with some specific stuff for our env and to disable automatic
restarts on upgrade which can cause problems especially for
upgrading the slurm database.</p>
<p>2. We test the upgrade on our test cluster using the following
sequence.</p>
<p>a. Pause all jobs and stop all scheduling.</p>
<p>b. Stop slurmctld and slurmdbd.</p>
<p>c. Backup spool and the database.<br>
</p>
<p>d. Upgrade slurm rpms (note that you need to make sure that the
upgrade will not automatically restart the dbd or the ctld else
you may end up in a world of hurt)</p>
<p>e. Run slurmdbd -Dvvvvv to do the database upgrade. Depending on
the upgrade this can take a while because of database schema
changes.</p>
<p>f. Restart slurmdbd using the service</p>
<p>g. Upgrading slurm rpms across the cluster using salt.</p>
<p>h. Global restart of slurmd and slurmctld.</p>
<p>3. If that all looks good we rinse and repeat on our production
cluster.</p>
<p>The rpms have worked fine for us. The main hitch is the
automatic restart on upgrade, which I do not recommend. You
should neuter that portion of the provided spec file, especially
for the slurmdbd upgrades.</p>
<p>We generally prefer the RPM method as it is the normal method for
interaction with the OS and works well with Puppet.<br>
</p>
<p>-Paul Edmon-<br>
</p>
<div class="moz-cite-prefix">On 11/2/2020 10:13 AM, Jason Simms
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAP7JYwdqOOSH15Jbr8s9rLhPFcS2JmrFO3k+W=dATyM_npiG6A@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hello all,<br>
<div><br>
</div>
<div>I am going to reveal the degree of my inexperience here,
but am I perhaps the only one who thinks that Slurm's upgrade
procedure is too complex? Or, at least maybe not explained in
enough detail?</div>
<div><br>
</div>
<div>I'm running a CentOS 8 cluster, and to me, I should be able
simply to update the Slurm package and any of its
dependencies, and that's it. When I looked at the notes from
the recent Slurm Users' Group meeting, however, I see that
while that mode is technically supported, it is not
recommended, and instead one should always rebuild from
source. Really?</div>
<div><br>
</div>
<div>So, ok, regardless whether that's the case, the upgrade
notes linked to in the prior post don't, in my opinion, go
into enough detail. It tells you broadly what to do, but not
necessarily how to do it. I'd welcome example commands for
each step (understanding that changes might be needed to
account for local configurations). There are no examples in
that section, for example, addressing recompiling from source.</div>
<div><br>
</div>
<div>Now, I suspect a chorus of "if you don't understand it well
enough, you shouldn't be managing it." OK. Perhaps that's fair
enough. But I came into this role via a non-traditional route
and am constantly trying to improve my admin skills, and I may
not have the complete mastery of all aspects quite yet. But I
would also say that documentation should be clear and
complete, and not written solely for experts. To be honest,
I've had to go to lots of documentation external to SchedMD to
see good examples of actually working with Slurm, or even ask
the helpful people on this group. And I firmly believe that if
there is a packaged version of your software - as there is for
Slurm - that should be the default, fully-working way to
upgrade.</div>
<div><br>
</div>
<div>Warmest regards,</div>
<div>Jason</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Nov 2, 2020 at 9:28 AM
Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"
moz-do-not-send="true">pedmon@cfa.harvard.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>In general I would follow this:</p>
<p><a
href="https://slurm.schedmd.com/quickstart_admin.html#upgrade"
target="_blank" moz-do-not-send="true">https://slurm.schedmd.com/quickstart_admin.html#upgrade</a></p>
<p>Namely:</p>
<p>Almost every new major release of Slurm (e.g. 19.05.x to
20.02.x) involves changes to the state files with new data
structures, new options, etc. Slurm permits upgrades to a
new major release from the past two major releases, which
happen every nine months (e.g. 18.08.x or 19.05.x to
20.02.x) without loss of jobs or other state information.
State information from older versions will not be
recognized and will be discarded, resulting in loss of all
running and pending jobs. State files are <b>not</b>
recognized when downgrading (e.g. from 19.05.x to 18.08.x)
and will be discarded, resulting in loss of all running
and pending jobs. For this reason, creating backup copies
of state files (as described below) can be of value.
Therefore when upgrading Slurm (more precisely, the
slurmctld daemon), saving the <i>StateSaveLocation</i>
(as defined in <i>slurm.conf</i>) directory contents with
all state information is recommended. If you need to
downgrade, restoring that directory's contents will let
you recover the jobs. Jobs submitted under the new version
will not be in those state files, but it can let you
recover most jobs. An exception to this is that jobs may
be lost when installing new pre-release versions (e.g.
20.02.0-pre1 to 20.02.0-pre2). Developers will try to note
these cases in the NEWS file. Contents of major releases
are also described in the RELEASE_NOTES file.</p>
<p>So I wouldn't go directly to 20.x, instead I would go
from 17.x to 19.x and then to 20.x</p>
<p>-Paul Edmon-<br>
</p>
<div>On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">We're doing something similar. We're
continuing to run production on 17.x and have set up a
new server/cluster running 20.x for testing and MPI app
rebuilds.
<div><br>
</div>
<div>Our plan had been to add recently purchased nodes
to the new cluster, and at some point turn off
submission on the old cluster and switch everyone to
submission on the new cluster (new login/submission
hosts). That way previously submitted MPI apps would
continue to run properly. As the old cluster
partitions started to clear out we'd mark ranges of
nodes to drain and move them to the new cluster.</div>
<div><br>
</div>
<div>We've since decided to wait until January, when
we've scheduled some downtime. The process will remain
the same wrt moving nodes from the old cluster to the
new, _except_ that everything will be drained, so we
can move big blocks of nodes and avoid slurm.conf
Partition line ugliness.</div>
<div><br>
</div>
<div>We're starting with a fresh database to get rid of
the bug induced corruption that prevents GPUs from
being fenced with cgroups.</div>
<div><br>
</div>
<div>regards,</div>
<div>s</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Nov 2, 2020 at
8:28 AM navin srivastava <<a
href="mailto:navin.altair@gmail.com" target="_blank"
moz-do-not-send="true">navin.altair@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Dear All,<br>
<div><br>
</div>
<div>Currently we are running slurm version 17.11.x
and wanted to move to 20.x.</div>
<div><br>
</div>
<div>We are building the New server with Slurm 20.2
version and planning to upgrade the client nodes
from 17.x to 20.x.</div>
<div><br>
</div>
<div>wanted to check if we can upgrade the Client
from 17.x to 20.x directly or we need to go
through 17.x to 18.x and 19.x then 20.x</div>
<div><br>
</div>
<div>Regards</div>
<div>Navin.</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr" class="gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:14px;margin:0px"><span
style="color:rgb(130,36,51)"><font
face="Century Gothic"><b>Jason L. Simms,
Ph.D., M.P.H.</b></font></span></div>
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:14px;margin:0px"><font
face="Century Gothic"><span>Manager of
Research and High-Performance Computing</span></font></div>
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:14px;margin:0px"><font
face="Century Gothic"><span>XSEDE Campus
Champion<br>
</span><span style="color:gray">Lafayette
College<br>
Information Technology Services<br>
710 Sullivan Rd | Easton, PA 18042<br>
Office: 112 Skillman Library<br>
p: (610) 330-5632</span></font></div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</body>
</html>