[slurm-users] slurmctld: slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer
Rike-Benjamin Schuppner
rike.schuppner at bccn-berlin.de
Thu Jan 25 11:16:39 UTC 2024
Hi,
I am getting the following error in the logs whenever I run a few srun jobs in a batch.
Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: _send_timeout: Socket POLLERR: Connection reset by peer
Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: error: slurm_send_node_msg: [socket:[921897]] slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer
Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: laying out the 1 tasks on 1 hosts compute2 dist 1
The slurm version is 23.11.3 and an example sbatch file is:
#!/bin/bash
#SBATCH --job-name=slurm_test
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --output=slurm_test_%j.log
pwd; hostname; date
srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &
srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &
wait
The more sruns I have in the script (&-backgrounded or not), the more the error shows up. Is there anything I could do to fix this error?
Best
/rike
More information about the slurm-users
mailing list