[slurm-users] send/recv timeout and zero bytes transmitted errors

Wed Jun 12 05:56:20 UTC 2019

Hi all,

We are having issues submitting MPI jobs. There is intermittent issues 
as well where no slurm commands can be issued. The major concern is that 
jobs cannot be submitted to a particular set of 16 nodes in its own 
partition (1 mpi job for example, not 100s at once..)

There are numerous entries of the following on slurmctld.

  * [2019-06-11T23:30:10.314] error: slurm_receive_msg
    [10.141.0.131:44773]: Zero Bytes were transmitted or received
  * [2019-06-11T23:31:15.663] error:
    _shutdown_backup_controller:send/recv: Connection refused

Stdout/error has:

srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn066: Socket timed 
out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn065: Socket timed 
out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn061: Socket timed 
out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn062: Socket timed 
out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv 
operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 357052.0 ON pn058 CANCELLED AT 
2019-06-11T11:21:44 ***
[mpiexec at pn058] control_cb (../../pm/pmiserv/pmiserv_cb.c:864): 
connection to proxy 14 at host pn072 failed
[mpiexec at pn058] HYDT_dmxu_poll_wait_for_event 
(../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at pn058] HYD_pmci_wait_for_completion 
(../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event
[mpiexec at pn058] main (../../ui/mpich/mpiexec.c:1149): process manager 
error waiting for completion

NTP seems to be in in wokring (according the bright cluster manager at 
least). There has not been any change in our environment. This issue 
appear and have gotten worse over the last few days.

We are using SLURM 17.11.2.

Any input is appreciated. Thank you very much!

-- 
Andrei Huang

High Performance Computing - IS
Woods Hole Oceanographic Institution
ahuang at whoi.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190612/ce5ece68/attachment.html>