Hi,
I am getting the following error in the logs whenever I run a few srun jobs in a batch.
Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: _send_timeout: Socket POLLERR: Connection reset by peer Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: error: slurm_send_node_msg: [socket:[921897]] slurm_bufs_sendto(msg_type=SRUN_STEP_SIGNAL) failed: Connection reset by peer Jan 25 11:24:03 slurmctl.XYZ slurmctld[272961]: slurmctld: debug: laying out the 1 tasks on 1 hosts compute2 dist 1
The slurm version is 23.11.3 and an example sbatch file is:
#!/bin/bash #SBATCH --job-name=slurm_test #SBATCH --mem=1gb #SBATCH --time=00:05:00 #SBATCH --output=slurm_test_%j.log pwd; hostname; date
srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" & srun --exclusive -c1 -N1 -n1 bash -c "hostname ; sleep 10" &
wait
The more sruns I have in the script (&-backgrounded or not), the more the error shows up. Is there anything I could do to fix this error?
Best /rike
Dear Rike
I'm facing the same error in my own cluster, slurm version 23.11.3. And i notice that my task are running in sequence, not in parralele. I'm using the example inside the srun manual:
#!/bin/bash
srun -n1 sleep 30 & srun -n1 sleep 45 & srun -n1 sleep 20 & srun -n1 sleep 25 & wait
$ sbatch -n4 test.sh
I hope that should running in 45 seconds, as my server have 64 cores... But no, each task is running sequencialy.
Hope that someone could help us?
Regards