[slurm-users] Upgrading slurm - can I do it while jobs running?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed May 26 18:48:49 UTC 2021
On 26-05-2021 20:23, Will Dennis wrote:
> About to embark on my first Slurm upgrade (building from source now,
> into a versioned path /opt/slurm/<vernum>/ which is then symlinked to
> /opt/slurm/current/ for the “in-use” one…) This is a new cluster,
> running 20.11.5 (which we now know has a CVE that was fixed in 20.11.7)
> but I have researchers running jobs on it currently. As I’m still
> building out the cluster, I found today that all Slurm source tarballs
> before 20.11.7 were withdrawn by SchedMD. So, need to upgrade at least
> the -ctld and -dbd nodes before I can roll any new nodes out on 20.11.7…
>
> As I have at least one researcher that is running some long multi-day
> jobs, can I down the -dbd and -ctld nodes and upgrade them, then put
> them back online running the new (latest) release, without munging the
> jobs on the running worker nodes?
I recommend strongly to read the SchedMD presentations in the
https://slurm.schedmd.com/publications.html page, especially the "Field
notes" documents. The latest one is "Field Notes 4: From The Frontlines
of Slurm Support", Jason Booth, SchedMD.
We upgrade Slurm continuously while the nodes are in production mode.
There's a required order of upgrading: first slurmdbd, then slurmctld,
then slurmd nodes, and finally login nodes, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
The detailed upgrading commands for CentOS are in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-on-centos-7
We don't have any problems with running jobs across upgrades, but
perhaps others can share their experiences?
/Ole
More information about the slurm-users
mailing list