[slurm-users] srun: error: io_init_msg_unpack: unpack error

Chris Samuel chris at csamuel.org
Sat Aug 6 19:13:17 UTC 2022


On 6/8/22 10:43 am, David Magda wrote:

> It seems that the the new srun(1) cannot talk to the old slurmd(8).
> 
> Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)?

That's expected, what you're hoping for here is forward compatibility.

Newer daemons know how to talk to older utilities, but it doesn't work 
the other way around.

What we do in this situation is upgrade slurmdbd, then slurmctld, change 
our images for compute nodes to be ones that have the new Slurm version 
then before we bring partitions back up we issue an "scontrol reboot 
ASAP nextstate=resume" for all the compute nodes.

This means existing jobs will keep going but no new jobs will start on 
compute nodes with older versions of Slurm from that point on. As jobs 
on nodes finish they'll get rebooted into the new images and will accept 
jobs again (the "ASAP" flag drains the node, then once it's successfully 
started its slurmd as the final thing on boot it'll undrain at that 
point - and also slurmctld is smart with planning its scheduling for 
this situation).

It's also safe to restart slurmd's with running jobs, though you may 
want to drain them before that so slurmctld won't try and send them a 
job in the middle.

The one issue you can get where backwards compatibility in the Slurm 
protocol can't help is if there are incompatible config file changes 
needed, then you need to bite the bullet and upgrade the slurmd's and 
commands at the same time everywhere where the new config file goes (and 
for those of running in configless mode that means everywhere).

Hope this helps! All the best,
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




More information about the slurm-users mailing list