[slurm-users] srun: error: io_init_msg_unpack: unpack error
David Magda
dmagda+slurmu at ee.torontomu.ca
Sat Aug 6 17:43:13 UTC 2022
Hello,
We are testing the upgrade process with going from 20.11.9 to 22.05.2. The master server is running 22.05.2 slurmctld/slurmdbd, and the compute nodes are (currently) running slurm-20.11.9 slurmd. We are running this 'mixed environment' because our production cluster has a reasonable number of nodes (~200) so it will take a while to get through them all.
Back to our smaller (test) cluster: things are generally working fine in that jobs are scheduled, launched, and finish cleanly.
The main issue we're experiencing is with srun(1). If you execute the "new" binary the following output is generated:
$ /opt/slurm-22.05.2/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty bash
srun: job 1939765 queued and waiting for resources
srun: job 1939765 has been allocated resources
srun: error: io_init_msg_unpack: unpack error
srun: error: io_init_msg_read_from_fd: io_init_msg_unpack failed: rc=-1
srun: error: failed reading io init message
If I SSH into the host manually (as root), I do see a shell session for my user running bash. Running the "old" binary:
$ /opt/slurm-20.11.9b/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty bash
srun: job 1939768 queued and waiting for resources
srun: job 1939768 has been allocated resources
dmagda at wsgpu11:~$
It seems that the the new srun(1) cannot talk to the old slurmd(8).
Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)?
Is there any way around this, or should we simply upgrade slurmd(8) on the work nodes, but leave the paths to the older user CLI utilities alone until all the compute nodes have been upgraded?
Thanks for any info.
Regards,
David
More information about the slurm-users
mailing list