[slurm-users] slurmctld: slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer

Rike-Benjamin Schuppner rike.schuppner at bccn-berlin.de
Thu Jan 25 11:16:39 UTC 2024


Hi,

I am getting the following error in the logs whenever I run a few srun jobs in a batch.

   Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug:  _send_timeout: Socket POLLERR: Connection reset by peer
   Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: error: slurm_send_node_msg: [socket:[921897]] slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer
   Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug:  laying out the 1 tasks on 1 hosts compute2 dist 1                                                                                                                  

The slurm version is 23.11.3 and an example sbatch file is:

   #!/bin/bash
   #SBATCH --job-name=slurm_test
   #SBATCH --mem=1gb
   #SBATCH --time=00:05:00
   #SBATCH --output=slurm_test_%j.log
   pwd; hostname; date

   srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &
   srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &

   wait

The more sruns I have in the script (&-backgrounded or not), the more the error shows up. Is there anything I could do to fix this error?

Best
/rike




More information about the slurm-users mailing list