slurmctld: slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer - slurm-users

25 Jan 2024


      Hi,
I am getting the following error in the logs whenever I run a few srun jobs in a batch.
Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug:  _send_timeout: Socket POLLERR: Connection reset by peer
   Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: error: slurm_send_node_msg: [socket:[921897]] slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer
   Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug:  laying out the 1 tasks on 1 hosts compute2 dist 1
The slurm version is 23.11.3 and an example sbatch file is:
#!/bin/bash
   #SBATCH --job-name=slurm_test
   #SBATCH --mem=1gb
   #SBATCH --time=00:05:00
   #SBATCH --output=slurm_test_%j.log
   pwd; hostname; date
srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &
   srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &
wait
The more sruns I have in the script (&-backgrounded or not), the more the error shows up. Is there anything I could do to fix this error?
Best
/rike