[slurm-users] Rolling upgrade of compute nodes

Mon May 30 10:01:00 UTC 2022

Thanks for the feedback.

I've done the database dryrun on a clone of our database / slurmdbd and
that is all good.

We have a reboot program defined.

The one thing I'm unsure about is as much as Linux / NFS issue than a a
slurm one.  When I change the soft link for "default" to point to the new
20.11 slurm install but all the compute nodes are still run the old 19.05
version because they havent been restarted yet, will that not cause any
problems?   Or will they still just see the same old 19.05 version of slurm
that they are running until they are restarted.

thanks

On Mon, May 30, 2022 at 8:18 AM Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
wrote:

> Hi Byron,
>
> Adding to Stephan's note, it's strongly recommended to make a database
> dry-run upgrade test before upgrading the production slurmdbd.  Many
> details are in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> If you have separate slurmdbd and slurmctld machines (recommended), the
> next step is to upgrade the slurmctld.
>
> Finally you can upgrade the slurmd's while the cluster is running in
> production mode.  Since you have Slurm om NFS, following Chris'
> recommendation of rebooting the nodes may be the safest approach.
>
> After upgrading everything to 20.11, you should next upgrade to 21.08.
> Upgrade to the latest 22.05 should probably wait for a few minor releases.
>
> /Ole
>
> On 5/30/22 08:54, Stephan Roth wrote:
> > If you have the means to set up a test environment to try the upgrade
> > first, I recommend to do it.
> >
> > The upgrade from 19.05 to 20.11 worked for two clusters I maintain with
> a
> > similar NFS based setup, except we keep the Slurm configuration
> separated
> > from the Slurm software accessible through NFS.
> >
> > For updates staying between 2 major releases this should work well by
> > restarting the Slurm daemons in the recommended order (see
> > https://slurm.schedmd.com/SLUG19/Field_Notes_3.pdf) after switching the
> > soft link to 20.11:
> >
> > 1. slurmdbd
> > 2. slurmctld
> > 3. individual slurmd on your nodes
> >
> > To be able to revert back to 19.05 you should dump the database between
> > stopping and starting slurmdbd as well as backing up StateSaveLocation
> > between stopping/restarting slurmctld.
> >
> > slurmstepd's of running jobs will continue to run on 19.05 after
> > restarting the slurmd's.
> >
> > Check individual slurmd.log files for problems.
> >
> > Cheers,
> > Stephan
> >
> > On 30.05.22 00:09, byron wrote:
> >> Hi
> >>
> >> I'm currently doing an upgrade from 19.05 to 20.11.
> >>
> >> All of our compute nodes have the same install of slurm NFS mounted.
> The
> >> system has been setup so that all the start scripts and configuration
> >> files point to the default installation which is a soft link to the
> most
> >> recent installation of slurm.
> >>
> >>   This is the first time I've done an upgrade of slurm and I had been
> >> hoping to do a rolling upgrade as opposed to waiting for all the jobs
> to
> >> finish on all the compute nodes and then switching across but I dont
> see
> >> how I can do it with this setup.  Does any one have any expereience of
> >> this?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220530/dac1213e/attachment.htm>