[slurm-users] Slurm Upgrade

Paul Edmon pedmon at cfa.harvard.edu
Mon Nov 2 15:54:02 UTC 2020


We haven't really had MPI ugliness with the latest versions. Plus we've 
been rolling our own PMIx and building against that which seems to have 
solved most of the cross compatibility issues.

-Paul Edmon-

On 11/2/2020 10:38 AM, Fulcomer, Samuel wrote:
> Our strategy is a bit simpler. We're migrating compute nodes to a new 
> cluster running 20.x. This isn't an upgrade. We'll keep the old 
> slurmdbd running for at least enough time to suck the remaining 
> accounting data into XDMoD.
>
> The old cluster will keep running jobs until there are no more to run. 
> We'll drain and move nodes to the new cluster as we start seeing more 
> and more idle nodes in the old cluster. This avoids MPI ugliness and 
> we move directly to 20.x.
>
>
>
> On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon <pedmon at cfa.harvard.edu 
> <mailto:pedmon at cfa.harvard.edu>> wrote:
>
>     In general  I would follow this:
>
>     https://slurm.schedmd.com/quickstart_admin.html#upgrade
>     <https://slurm.schedmd.com/quickstart_admin.html#upgrade>
>
>     Namely:
>
>     Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x)
>     involves changes to the state files with new data structures, new
>     options, etc. Slurm permits upgrades to a new major release from
>     the past two major releases, which happen every nine months (e.g.
>     18.08.x or 19.05.x to 20.02.x) without loss of jobs or other state
>     information. State information from older versions will not be
>     recognized and will be discarded, resulting in loss of all running
>     and pending jobs. State files are *not* recognized when
>     downgrading (e.g. from 19.05.x to 18.08.x) and will be discarded,
>     resulting in loss of all running and pending jobs. For this
>     reason, creating backup copies of state files (as described below)
>     can be of value. Therefore when upgrading Slurm (more precisely,
>     the slurmctld daemon), saving the /StateSaveLocation/ (as defined
>     in /slurm.conf/) directory contents with all state information is
>     recommended. If you need to downgrade, restoring that directory's
>     contents will let you recover the jobs. Jobs submitted under the
>     new version will not be in those state files, but it can let you
>     recover most jobs. An exception to this is that jobs may be lost
>     when installing new pre-release versions (e.g. 20.02.0-pre1 to
>     20.02.0-pre2). Developers will try to note these cases in the NEWS
>     file. Contents of major releases are also described in the
>     RELEASE_NOTES file.
>
>     So I wouldn't go directly to 20.x, instead I would go from 17.x to
>     19.x and then to 20.x
>
>     -Paul Edmon-
>
>     On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:
>>     We're doing something similar. We're continuing to run production
>>     on 17.x and have set up a new server/cluster  running 20.x for
>>     testing and MPI app rebuilds.
>>
>>     Our plan had been to add recently purchased nodes to the new
>>     cluster, and at some point turn off submission on the old cluster
>>     and switch everyone to submission on the new cluster (new
>>     login/submission hosts). That way previously submitted MPI apps
>>     would continue to run properly. As the old cluster partitions
>>     started to clear out we'd mark ranges of nodes to drain and move
>>     them to the new cluster.
>>
>>     We've since decided to wait until January, when we've scheduled
>>     some downtime. The process will remain the same wrt moving nodes
>>     from the old cluster to the new, _except_ that everything will be
>>     drained, so we can move big blocks of nodes and avoid slurm.conf
>>     Partition line ugliness.
>>
>>     We're starting with a fresh database to get rid of the bug
>>     induced corruption that prevents GPUs from being fenced with cgroups.
>>
>>     regards,
>>     s
>>
>>     On Mon, Nov 2, 2020 at 8:28 AM navin srivastava
>>     <navin.altair at gmail.com <mailto:navin.altair at gmail.com>> wrote:
>>
>>         Dear All,
>>
>>         Currently we are running slurm version 17.11.x and wanted to
>>         move to 20.x.
>>
>>         We are building the New server with Slurm 20.2 version and
>>         planning to upgrade the client nodes from 17.x to 20.x.
>>
>>         wanted to check if we can upgrade the Client from 17.x to
>>         20.x directly or we need to go through 17.x to 18.x and 19.x
>>         then 20.x
>>
>>         Regards
>>         Navin.
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201102/d1bc2617/attachment.htm>


More information about the slurm-users mailing list