[slurm-users] Rolling upgrade of compute nodes

Stephan Roth stephan.roth at ee.ethz.ch
Mon May 30 19:39:36 UTC 2022


I can confirm twhat Ümit did worked for my setup as well.

But as I mentioned before, if there's any doubt, try the upgrade in a 
test environment first.

Cheers,
Stephan

On 30.05.22 21:06, Ümit Seren wrote:
> We did a couple of major and minor SLURM upgrades without draining the 
> compute nodes.
> 
> Once slurmdbd and slurmctld were updated to the new major version, we 
> did a package update on the compute nodes and restarted slurmd on them.
> 
> The existing running jobs continued to run fine and new jobs on the same 
> compute started by the updated slurmd daemon and also worked fine.
> 
> 
> So, for us this worked smoothly.
> 
> Best
> 
> Ümit
> 
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of 
> Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> *Date: *Monday, 30. May 2022 at 20:58
> *To: *slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] Rolling upgrade of compute nodes
> 
> On 30-05-2022 19:34, Chris Samuel wrote:
>> On 30/5/22 10:06 am, Chris Samuel wrote:
>> 
>>> If you switch that symlink those jobs will pick up the 20.11 srun 
>>> binary and that's where you may come unstuck.
>> 
>> Just to quickly fix that, srun talks to slurmctld (which would also be 
>> 20.11 for you), slurmctld will talk to the slurmd's running the job 
>> (which would be 19.05, so OK) but then the slurmd would try and launch a 
>> 20.11 slurmstepd and that is where I suspect things could come undone.
> 
> How about restarting all slurmd's at version 20.11 in one shot?  No
> reboot will be required.  There will be running 19.05 slurmstepd's for
> the running job steps, even though slurmd is at 20.11.  You could
> perhaps restart 20.11 slurmd one partition at a time in order to see if
> it works correctly on a small partition of the cluster.
> 
> I think we have done this successfully when we install new RPMs on *all*
> compute nodes in one shot, and I'm not aware of any job crashes.  Your
> mileage may vary depending on job types!
> 
> Question: Does anyone have bad experiences with upgrading slurmd while
> the cluster is running production?
> 
> /Ole
> 


--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich

Phone +41 44 632 30 59
stephan.roth at ee.ethz.ch
www.isg.ee.ethz.ch

Working days: Mon,Tue,Thu,Fri
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220530/924a9a3a/attachment-0001.bin>


More information about the slurm-users mailing list