[slurm-users] srun: error: io_init_msg_unpack: unpack error
Chris Samuel
chris at csamuel.org
Sat Aug 6 19:13:17 UTC 2022
On 6/8/22 10:43 am, David Magda wrote:
> It seems that the the new srun(1) cannot talk to the old slurmd(8).
>
> Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)?
That's expected, what you're hoping for here is forward compatibility.
Newer daemons know how to talk to older utilities, but it doesn't work
the other way around.
What we do in this situation is upgrade slurmdbd, then slurmctld, change
our images for compute nodes to be ones that have the new Slurm version
then before we bring partitions back up we issue an "scontrol reboot
ASAP nextstate=resume" for all the compute nodes.
This means existing jobs will keep going but no new jobs will start on
compute nodes with older versions of Slurm from that point on. As jobs
on nodes finish they'll get rebooted into the new images and will accept
jobs again (the "ASAP" flag drains the node, then once it's successfully
started its slurmd as the final thing on boot it'll undrain at that
point - and also slurmctld is smart with planning its scheduling for
this situation).
It's also safe to restart slurmd's with running jobs, though you may
want to drain them before that so slurmctld won't try and send them a
job in the middle.
The one issue you can get where backwards compatibility in the Slurm
protocol can't help is if there are incompatible config file changes
needed, then you need to bite the bullet and upgrade the slurmd's and
commands at the same time everywhere where the new config file goes (and
for those of running in configless mode that means everywhere).
Hope this helps! All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list