[slurm-users] Rolling upgrade of compute nodes
Stephan Roth
stephan.roth at ee.ethz.ch
Mon May 30 19:39:36 UTC 2022
I can confirm twhat Ümit did worked for my setup as well.
But as I mentioned before, if there's any doubt, try the upgrade in a
test environment first.
Cheers,
Stephan
On 30.05.22 21:06, Ümit Seren wrote:
> We did a couple of major and minor SLURM upgrades without draining the
> compute nodes.
>
> Once slurmdbd and slurmctld were updated to the new major version, we
> did a package update on the compute nodes and restarted slurmd on them.
>
> The existing running jobs continued to run fine and new jobs on the same
> compute started by the updated slurmd daemon and also worked fine.
>
>
> So, for us this worked smoothly.
>
> Best
>
> Ümit
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> *Date: *Monday, 30. May 2022 at 20:58
> *To: *slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] Rolling upgrade of compute nodes
>
> On 30-05-2022 19:34, Chris Samuel wrote:
>> On 30/5/22 10:06 am, Chris Samuel wrote:
>>
>>> If you switch that symlink those jobs will pick up the 20.11 srun
>>> binary and that's where you may come unstuck.
>>
>> Just to quickly fix that, srun talks to slurmctld (which would also be
>> 20.11 for you), slurmctld will talk to the slurmd's running the job
>> (which would be 19.05, so OK) but then the slurmd would try and launch a
>> 20.11 slurmstepd and that is where I suspect things could come undone.
>
> How about restarting all slurmd's at version 20.11 in one shot? No
> reboot will be required. There will be running 19.05 slurmstepd's for
> the running job steps, even though slurmd is at 20.11. You could
> perhaps restart 20.11 slurmd one partition at a time in order to see if
> it works correctly on a small partition of the cluster.
>
> I think we have done this successfully when we install new RPMs on *all*
> compute nodes in one shot, and I'm not aware of any job crashes. Your
> mileage may vary depending on job types!
>
> Question: Does anyone have bad experiences with upgrading slurmd while
> the cluster is running production?
>
> /Ole
>
--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich
Phone +41 44 632 30 59
stephan.roth at ee.ethz.ch
www.isg.ee.ethz.ch
Working days: Mon,Tue,Thu,Fri
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220530/924a9a3a/attachment-0001.bin>
More information about the slurm-users
mailing list