[slurm-users] send/recv timeout and zero bytes transmitted errors
Andrei Huang
ahuang at whoi.edu
Wed Jun 12 05:56:20 UTC 2019
Hi all,
We are having issues submitting MPI jobs. There is intermittent issues
as well where no slurm commands can be issued. The major concern is that
jobs cannot be submitted to a particular set of 16 nodes in its own
partition (1 mpi job for example, not 100s at once..)
There are numerous entries of the following on slurmctld.
* [2019-06-11T23:30:10.314] error: slurm_receive_msg
[10.141.0.131:44773]: Zero Bytes were transmitted or received
* [2019-06-11T23:31:15.663] error:
_shutdown_backup_controller:send/recv: Connection refused
Stdout/error has:
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn066: Socket timed
out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn065: Socket timed
out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn061: Socket timed
out on send/recv operation
srun: error: Task launch for 357052.0 failed on node pn062: Socket timed
out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv
operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 357052.0 ON pn058 CANCELLED AT
2019-06-11T11:21:44 ***
[mpiexec at pn058] control_cb (../../pm/pmiserv/pmiserv_cb.c:864):
connection to proxy 14 at host pn072 failed
[mpiexec at pn058] HYDT_dmxu_poll_wait_for_event
(../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at pn058] HYD_pmci_wait_for_completion
(../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event
[mpiexec at pn058] main (../../ui/mpich/mpiexec.c:1149): process manager
error waiting for completion
NTP seems to be in in wokring (according the bright cluster manager at
least). There has not been any change in our environment. This issue
appear and have gotten worse over the last few days.
We are using SLURM 17.11.2.
Any input is appreciated. Thank you very much!
--
Andrei Huang
High Performance Computing - IS
Woods Hole Oceanographic Institution
ahuang at whoi.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190612/ce5ece68/attachment.html>
More information about the slurm-users
mailing list