[slurm-users] Running mix versions of slurm while upgrading
Andy Georges
andy.georges at ugent.be
Mon Oct 21 06:35:35 UTC 2019
Hi Tony,
On Mon, Oct 21, 2019 at 01:52:21AM +0000, Tony Racho wrote:
> Hi:
>
> We are planning to upgrade our slurm cluster however we plan on NOT doing it in a one-go.
>
> We are on 18.08.7 at the moment (db, controller, clients)
>
> We'd like to do it in a phased approach.
>
> Stop communication between controller and slurmdbd while updating slurmdbd to 19.05.X.
>
> Concurrently, we will update our primary controller to 19.05.X while the back-up controller will take-over the primary's chores. (and then the back-up controller will also be upgraded to 19.05.X)
>
> Once primary controller has been updated to 19.05.X, obviously it assumes back the cluster but the clients will still be 18.08.7 will there be any issues with this set-up and consequently if this works, we will choose a subset of clients and upgrade them to 19.05.X while the others will be on 18.08.7 until all the clients have been upgraded to 19.05.X.
>
> My question is will the process/set-up above work? Will the clients still be able to communicate to the controller without any unintended effect or issues? Has anyone done this process?
>
> Once all the controllers and clients are upgraded to 19.05.X, resume communication between the controllers and the slurmdbd.
We do not have backup controllers at the moment, so I cannot comment on
the exact situaton you are in. However, we upgraded from 17.11 to 19.05
as follows.
- Update slurmdbd (we first took a dump of the VM (after shutting down
slurmdbd and mariadb) and started a new VM from this dump to verify
the update would work without any problems -- this turned out to be
the case, so we could proceed on the production machine)
- For each cluster:
- Increase the timeouts for slurmcltd and slurmdb to an hour
- Set the partitions to DOWN so no new jobs would start
- Backup the slurm spool dir (state save location), just in case
- Bring down slurmctld and update the masters
- Update slurmd on all the worker nodes; they picked up the running
jobs and started chatting to the updated masters
- Lower the timeouts again to their original setting
Now, if you have a backup controller as well, I suppose you can bring it
down at the same time as you update the primary controller?
We were told that in principle we could also just update each worker
node as they became idle, so have two versions of slurmd running at the
same time -- and this is more or less what happened, only the duration
of that period was rather short as the nodes were updating sort of at
the same time (with some random delays to not overload the repo server).
>
> While doing the upgrade the following scenario will take place.
>
> slurmdbd - 19.05.X (but not communicating with the controllers)
There is no need to not have it talk to the controllers, I think. When
the update of slurmdbd is complete, you can have the controllers talk to
it, since they know how to handle incoming data from up to two versions
back. When slurmdbd is down, nothing can talk to it :) When it is back
up, everything can proceed as normal.
Hope this helps a bit,
-- Andy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191021/28b6d4b4/attachment-0002.sig>
More information about the slurm-users
mailing list