<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi all,</p>
<p>We are having issues submitting MPI jobs. There is intermittent
issues as well where no slurm commands can be issued. The major
concern is that jobs cannot be submitted to a particular set of 16
nodes in its own partition (1 mpi job for example, not 100s at
once..)<br>
</p>
<p>There are numerous entries of the following on slurmctld.</p>
<ul>
<li>[2019-06-11T23:30:10.314] error: slurm_receive_msg
[10.141.0.131:44773]: Zero Bytes were transmitted or received</li>
<li>[2019-06-11T23:31:15.663] error:
_shutdown_backup_controller:send/recv: Connection refused<br>
</li>
</ul>
<p>Stdout/error has:</p>
<p>srun: error: slurm_receive_msgs: Socket timed out on send/recv
operation<br>
srun: error: slurm_receive_msgs: Socket timed out on send/recv
operation<br>
srun: error: slurm_receive_msgs: Socket timed out on send/recv
operation<br>
srun: error: slurm_receive_msgs: Socket timed out on send/recv
operation<br>
srun: error: Task launch for 357052.0 failed on node pn066: Socket
timed out on send/recv operation<br>
srun: error: Task launch for 357052.0 failed on node pn065: Socket
timed out on send/recv operation<br>
srun: error: Task launch for 357052.0 failed on node pn061: Socket
timed out on send/recv operation<br>
srun: error: Task launch for 357052.0 failed on node pn062: Socket
timed out on send/recv operation<br>
srun: error: Application launch failed: Socket timed out on
send/recv operation<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
slurmstepd: error: *** STEP 357052.0 ON pn058 CANCELLED AT
2019-06-11T11:21:44 ***<br>
[mpiexec@pn058] control_cb (../../pm/pmiserv/pmiserv_cb.c:864):
connection to proxy 14 at host pn072 failed<br>
[mpiexec@pn058] HYDT_dmxu_poll_wait_for_event
(../../tools/demux/demux_poll.c:76): callback returned error
status<br>
[mpiexec@pn058] HYD_pmci_wait_for_completion
(../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event<br>
[mpiexec@pn058] main (../../ui/mpich/mpiexec.c:1149): process
manager error waiting for completion</p>
<p>NTP seems to be in in wokring (according the bright cluster
manager at least). There has not been any change in our
environment. This issue appear and have gotten worse over the last
few days. <br>
</p>
<p>We are using SLURM 17.11.2. <br>
</p>
<p>Any input is appreciated. Thank you very much!<br>
</p>
<p><br>
</p>
<pre class="moz-signature" cols="72">--
Andrei Huang
High Performance Computing - IS
Woods Hole Oceanographic Institution
<a class="moz-txt-link-abbreviated" href="mailto:ahuang@whoi.edu">ahuang@whoi.edu</a>
</pre>
</body>
</html>