[slurm-users] Upgrading slurm - can I do it while jobs running?

Wed May 26 18:48:49 UTC 2021

On 26-05-2021 20:23, Will Dennis wrote:
> About to embark on my first Slurm upgrade (building from source now, 
> into a versioned path /opt/slurm/<vernum>/ which is then symlinked to 
> /opt/slurm/current/ for the “in-use” one…) This is a new cluster, 
> running 20.11.5 (which we now know has a CVE that was fixed in 20.11.7) 
> but I have researchers running jobs on it currently. As I’m still 
> building out the cluster, I found today that all Slurm source tarballs 
> before 20.11.7 were withdrawn by SchedMD. So, need to upgrade at least 
> the -ctld and -dbd nodes before I can roll any new nodes out on 20.11.7…
> 
> As I have at least one researcher that is running some long multi-day 
> jobs, can I down the -dbd and -ctld nodes and upgrade them, then put 
> them back online running the new (latest) release, without munging the 
> jobs on the running worker nodes?

I recommend strongly to read the SchedMD presentations in the 
https://slurm.schedmd.com/publications.html page, especially the "Field 
notes" documents.  The latest one is "Field Notes 4: From The Frontlines 
of Slurm Support", Jason Booth, SchedMD.

We upgrade Slurm continuously while the nodes are in production mode. 
There's a required order of upgrading: first slurmdbd, then slurmctld, 
then slurmd nodes, and finally login nodes, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

The detailed upgrading commands for CentOS are in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-on-centos-7

We don't have any problems with running jobs across upgrades, but 
perhaps others can share their experiences?

/Ole