[slurm-users] Rolling upgrade of compute nodes
Chris Samuel
chris at csamuel.org
Mon May 30 17:06:52 UTC 2022
On 30/5/22 3:01 am, byron wrote:
> The one thing I'm unsure about is as much as Linux / NFS issue than a a
> slurm one. When I change the soft link for "default" to point to the
> new 20.11 slurm install but all the compute nodes are still run the old
> 19.05 version because they havent been restarted yet, will that not
> cause any problems? Or will they still just see the same old 19.05
> version of slurm that they are running until they are restarted.
That may cause issues, whilst the ASAP flag to scontrol reboot
guarantees no new jobs will start on the selected nodes until after
they've rebooted that doesn't (and shouldn't) stop new job steps from
srun starting on them.
If you switch that symlink those jobs will pick up the 20.11 srun binary
and that's where you may come unstuck.
This is one of the reasons why we do everything with Slurm installed via
RPM inside an image, you have a pretty straightforward A -> B transition.
If your symlink was node-local in some way (say created at boot time via
some config management system before slurmd start) then that could work
around that as then the nodes would still see the appropriate slurm
binaries for the running slurmd.
Best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list