[slurm-users] Upgrading slurm - can I do it while jobs running?
pedmon at cfa.harvard.edu
Wed May 26 18:58:26 UTC 2021
We generally pause scheduling during upgrades out of paranoia more than
anything. What that means is that we set all our partitions to DOWN and
suspend all the jobs. Then we do the upgrade. That said I know of
people who do it live with out much trouble.
The risk is more substantial for major version upgrades than minors. So
if you are doing a minor version upgrade its likely fine to do live.
For major version I would recommend at least pausing all the jobs.
On 5/26/2021 2:48 PM, Ole Holm Nielsen wrote:
> On 26-05-2021 20:23, Will Dennis wrote:
>> About to embark on my first Slurm upgrade (building from source now,
>> into a versioned path /opt/slurm/<vernum>/ which is then symlinked to
>> /opt/slurm/current/ for the “in-use” one…) This is a new cluster,
>> running 20.11.5 (which we now know has a CVE that was fixed in
>> 20.11.7) but I have researchers running jobs on it currently. As I’m
>> still building out the cluster, I found today that all Slurm source
>> tarballs before 20.11.7 were withdrawn by SchedMD. So, need to
>> upgrade at least the -ctld and -dbd nodes before I can roll any new
>> nodes out on 20.11.7…
>> As I have at least one researcher that is running some long multi-day
>> jobs, can I down the -dbd and -ctld nodes and upgrade them, then put
>> them back online running the new (latest) release, without munging
>> the jobs on the running worker nodes?
> I recommend strongly to read the SchedMD presentations in the
> https://slurm.schedmd.com/publications.html page, especially the
> "Field notes" documents. The latest one is "Field Notes 4: From The
> Frontlines of Slurm Support", Jason Booth, SchedMD.
> We upgrade Slurm continuously while the nodes are in production mode.
> There's a required order of upgrading: first slurmdbd, then slurmctld,
> then slurmd nodes, and finally login nodes, see
> The detailed upgrading commands for CentOS are in
> We don't have any problems with running jobs across upgrades, but
> perhaps others can share their experiences?
More information about the slurm-users