I found the problem. It was not that this node was trying to reach some machine. It was the other way around, some other machine (running controller) had this node in the config there, and hence that controller was trying to reach to this. It was a different slurm cluster. I removed the config from there and all is fine now.
On Wed, Jun 5, 2024 at 1:12 PM Arnuld arnuld@aganitha.ai wrote:
I have built Slurm 23.11.7 on two machines. Both are running Ubuntu 22.04. While Slurm runs fine on one machine, on the 2nd machine it does not. First machine is both a controller and a node while the 2nd machine is just a node. On both machines, I built the Slurm debian package as per the Slurm docs instructions. Slurmd logs show this:
error: unpack_header: protocol_version 9472 not supported error: unpacking header error: destroy_forward: no init error: slurm_receive_msg_and_forward: [[host-4.attlocal.net]:38960] failed: Message receive failure error: service_connection: slurm_receive_msg: Message receive failure debug: _service_connection: incomplete message