[slurm-users] srun: error: io_init_msg_unpack: unpack error

Mon Aug 8 23:11:04 UTC 2022

On Aug 6, 2022, at 15:13, Chris Samuel <chris at csamuel.org> wrote:
> 
> On 6/8/22 10:43 am, David Magda wrote:
> 
>> It seems that the the new srun(1) cannot talk to the old slurmd(8).
>> Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)?
> 
> That's expected, what you're hoping for here is forward compatibility.
> 
> Newer daemons know how to talk to older utilities, but it doesn't work the other way around.
> 
> What we do in this situation is upgrade slurmdbd, then slurmctld, change our images for compute nodes to be ones that have the new Slurm version then before we bring partitions back up we issue an "scontrol reboot ASAP nextstate=resume" for all the compute nodes.

Cool. So the CLI stuff will be the last thing to ‘update’ (for us, by changing the place the link /opt/slurm points to).

> It's also safe to restart slurmd's with running jobs, though you may want to drain them before that so slurmctld won't try and send them a job in the middle.

My testing has shown that this is not the case: any jobs that are running are killed with signal 15 if I do a ’systemctl restart slurmd’ or ’service slurmd restart’. Is there some flag in slurm.conf that allows for uninterruption of jobs?