[slurm-users] Rolling upgrade of compute nodes

Chris Samuel chris at csamuel.org
Mon May 30 17:06:52 UTC 2022


On 30/5/22 3:01 am, byron wrote:

> The one thing I'm unsure about is as much as Linux / NFS issue than a a 
> slurm one.  When I change the soft link for "default" to point to the 
> new 20.11 slurm install but all the compute nodes are still run the old 
> 19.05 version because they havent been restarted yet, will that not 
> cause any problems?   Or will they still just see the same old 19.05 
> version of slurm that they are running until they are restarted.

That may cause issues, whilst the ASAP flag to scontrol reboot 
guarantees no new jobs will start on the selected nodes until after 
they've rebooted that doesn't (and shouldn't) stop new job steps from 
srun starting on them.

If you switch that symlink those jobs will pick up the 20.11 srun binary 
and that's where you may come unstuck.

This is one of the reasons why we do everything with Slurm installed via 
RPM inside an image, you have a pretty straightforward A -> B transition.

If your symlink was node-local in some way (say created at boot time via 
some config management system before slurmd start) then that could work 
around that as then the nodes would still see the appropriate slurm 
binaries for the running slurmd.

Best of luck!
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the slurm-users mailing list