[slurm-users] srun: error: io_init_msg_unpack: unpack error

Sat Aug 6 17:43:13 UTC 2022

Hello,

We are testing the upgrade process with going from 20.11.9 to 22.05.2. The master server is running 22.05.2 slurmctld/slurmdbd, and the compute nodes are (currently) running slurm-20.11.9 slurmd. We are running this 'mixed environment' because our production cluster has a reasonable number of nodes (~200) so it will take a while to get through them all.

Back to our smaller (test) cluster: things are generally working fine in that jobs are scheduled, launched, and finish cleanly.

The main issue we're experiencing is with srun(1). If you execute the "new" binary the following output is generated:

	$ /opt/slurm-22.05.2/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty bash
	srun: job 1939765 queued and waiting for resources
	srun: job 1939765 has been allocated resources
	srun: error: io_init_msg_unpack: unpack error
	srun: error: io_init_msg_read_from_fd: io_init_msg_unpack failed: rc=-1
	srun: error: failed reading io init message

If I SSH into the host manually (as root), I do see a shell session for my user running bash.  Running the "old" binary:

	$ /opt/slurm-20.11.9b/bin/srun --mem=8GB --gres=gpu:1 -p wsgpu --pty bash
	srun: job 1939768 queued and waiting for resources
	srun: job 1939768 has been allocated resources
	dmagda at wsgpu11:~$ 

It seems that the the new srun(1) cannot talk to the old slurmd(8).

Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)?

Is there any way around this, or should we simply upgrade slurmd(8) on the work nodes, but leave the paths to the older user CLI utilities alone until all the compute nodes have been upgraded?

Thanks for any info.

Regards,
David