<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hi all,</p>

    <p>We are having issues submitting MPI jobs. There is intermittent

      issues as well where no slurm commands can be issued. The major

      concern is that jobs cannot be submitted to a particular set of 16

      nodes in its own partition (1 mpi job for example, not 100s at

      once..)<br>

    </p>

    <p>There are numerous entries of the following on slurmctld.</p>

    <ul>

      <li>[2019-06-11T23:30:10.314] error: slurm_receive_msg

        [10.141.0.131:44773]: Zero Bytes were transmitted or received</li>

      <li>[2019-06-11T23:31:15.663] error:

        _shutdown_backup_controller:send/recv: Connection refused<br>

      </li>

    </ul>

    <p>Stdout/error has:</p>

    <p>srun: error: slurm_receive_msgs: Socket timed out on send/recv

      operation<br>

      srun: error: slurm_receive_msgs: Socket timed out on send/recv

      operation<br>

      srun: error: slurm_receive_msgs: Socket timed out on send/recv

      operation<br>

      srun: error: slurm_receive_msgs: Socket timed out on send/recv

      operation<br>

      srun: error: Task launch for 357052.0 failed on node pn066: Socket

      timed out on send/recv operation<br>

      srun: error: Task launch for 357052.0 failed on node pn065: Socket

      timed out on send/recv operation<br>

      srun: error: Task launch for 357052.0 failed on node pn061: Socket

      timed out on send/recv operation<br>

      srun: error: Task launch for 357052.0 failed on node pn062: Socket

      timed out on send/recv operation<br>

      srun: error: Application launch failed: Socket timed out on

      send/recv operation<br>

      srun: Job step aborted: Waiting up to 32 seconds for job step to

      finish.<br>

      slurmstepd: error: *** STEP 357052.0 ON pn058 CANCELLED AT

      2019-06-11T11:21:44 ***<br>

      [mpiexec@pn058] control_cb (../../pm/pmiserv/pmiserv_cb.c:864):

      connection to proxy 14 at host pn072 failed<br>

      [mpiexec@pn058] HYDT_dmxu_poll_wait_for_event

      (../../tools/demux/demux_poll.c:76): callback returned error

      status<br>

      [mpiexec@pn058] HYD_pmci_wait_for_completion

      (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event<br>

      [mpiexec@pn058] main (../../ui/mpich/mpiexec.c:1149): process

      manager error waiting for completion</p>

    <p>NTP seems to be in in wokring (according the bright cluster

      manager at least). There has not been any change in our

      environment. This issue appear and have gotten worse over the last

      few days. <br>

    </p>

    <p>We are using SLURM 17.11.2.    <br>

    </p>

    <p>Any input is appreciated. Thank you very much!<br>

    </p>

    <p><br>

    </p>

    <pre class="moz-signature" cols="72">-- 

Andrei Huang


High Performance Computing - IS

Woods Hole Oceanographic Institution

<a class="moz-txt-link-abbreviated" href="mailto:ahuang@whoi.edu">ahuang@whoi.edu</a>

</pre>

  </body>

</html>