<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body>
    <p>Did you compile slurm with mpi support?</p>
    <p>Your mpi libraries should be the same as that version and they
      should be available in the same locations for all nodes.<br>
      Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are
      set)</p>
    <p>Brian Andrus<br>
    </p>
    <div class="moz-cite-prefix">On 2/4/2021 1:20 PM, Andrej Prsa wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:1776eeb4768.27aa.0a1e9d4977c8c9bdd3fef0d06331a96a@gmail.com">
      <meta http-equiv="content-type" content="text/html;
        charset=windows-1252">
      <div dir="auto">
        <div dir="auto">Gentle bump on this, if anyone has suggestions
          as I weed through the scattered slurm docs. :) </div>
        <div dir="auto"><br>
        </div>
        <div dir="auto">Thanks, </div>
        <div dir="auto">Andrej</div>
        <div dir="auto"><br>
        </div>
        <div id="aqm-original" style="color: black;">
          <div dir="auto">On February 2, 2021 00:14:37 Andrej Prsa
            <a class="moz-txt-link-rfc2396E" href="mailto:aprsa09@gmail.com"><aprsa09@gmail.com></a> wrote:</div>
          <div><br>
          </div>
          <blockquote type="cite" class="gmail_quote" style="margin: 0 0
            0 0.75ex; border-left: 1px solid #808080; padding-left:
            0.75ex;">
            <div dir="auto">Dear list,</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">I'm struggling with what seems to be very
              similar to this thread:</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto"><a class="moz-txt-link-freetext" href="https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html">https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html</a></div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">I'm using slurm 20.11.3 patched with this
              fix to detect pmixv4:</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">    
              <a class="moz-txt-link-freetext" href="https://bugs.schedmd.com/show_bug.cgi?id=10683">https://bugs.schedmd.com/show_bug.cgi?id=10683</a></div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">and this is what I'm seeing:</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">andrej@terra:~$ salloc -N 2 -n 2</div>
            <div dir="auto">salloc: Granted job allocation 841</div>
            <div dir="auto">andrej@terra:~$ srun hostname</div>
            <div dir="auto">srun: launch/slurm: launch_p_step_launch:
              StepId=841.0 aborted before </div>
            <div dir="auto">step completely launched.</div>
            <div dir="auto">srun: Job step aborted: Waiting up to 32
              seconds for job step to finish.</div>
            <div dir="auto">srun: error: task 0 launch failed:
              Unspecified error</div>
            <div dir="auto">srun: error: task 1 launch failed:
              Unspecified error</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">In slurmctld.log I have this:</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">[2021-02-01T23:58:13.683] sched:
              _slurm_rpc_allocate_resources JobId=841 </div>
            <div dir="auto">NodeList=node[9-10] usec=572</div>
            <div dir="auto">[2021-02-01T23:58:19.817] error:
              mpi_hook_slurmstepd_prefork failure for </div>
            <div dir="auto">0x557e7480bcb0s on node9</div>
            <div dir="auto">[2021-02-01T23:58:19.829] error:
              mpi_hook_slurmstepd_prefork failure for </div>
            <div dir="auto">0x55f568e00cb0s on node10</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">and in slurmd.log I have this for node9:</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">[2021-02-01T23:58:19.788] launch task
              StepId=841.0 request from UID:1000 </div>
            <div dir="auto">GID:1000 HOST:192.168.1.1 PORT:35508</div>
            <div dir="auto">[2021-02-01T23:58:19.789] task/affinity:
              lllp_distribution: JobId=841 </div>
            <div dir="auto">implicit auto binding: cores, dist 1</div>
            <div dir="auto">[2021-02-01T23:58:19.789] task/affinity:
              _task_layout_lllp_cyclic: </div>
            <div dir="auto">_task_layout_lllp_cyclic</div>
            <div dir="auto">[2021-02-01T23:58:19.789] task/affinity:
              _lllp_generate_cpu_bind: </div>
            <div dir="auto">_lllp_generate_cpu_bind jobid [841]:
              mask_cpu, 0x000000000001000000000001</div>
            <div dir="auto">[2021-02-01T23:58:19.814] [841.0] error:
              node9 [0] pmixp_utils.c:108 </div>
            <div dir="auto">[pmixp_usock_create_srv] mpi/pmix: ERROR:
              Cannot bind() UNIX socket </div>
            <div dir="auto">/var/spool/slurmd/stepd.slurm.pmix.841.0:
              Address already in use (98)</div>
            <div dir="auto">[2021-02-01T23:58:19.814] [841.0] error:
              node9 [0] pmixp_server.c:387 </div>
            <div dir="auto">[pmixp_stepd_init] mpi/pmix: ERROR:
              pmixp_usock_create_srv</div>
            <div dir="auto">[2021-02-01T23:58:19.814] [841.0] error:
              (null) [0] mpi_pmix.c:169 </div>
            <div dir="auto">[p_mpi_hook_slurmstepd_prefork] mpi/pmix:
              ERROR: pmixp_stepd_init() failed</div>
            <div dir="auto">[2021-02-01T23:58:19.817] [841.0] error:
              Failed mpi_hook_slurmstepd_prefork</div>
            <div dir="auto">[2021-02-01T23:58:19.845] [841.0] error:
              job_manager exiting abnormally, </div>
            <div dir="auto">rc = -1</div>
            <div dir="auto">[2021-02-01T23:58:19.892] [841.0] done with
              job</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">and this for node10:</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">[2021-02-01T23:58:19.788] launch task
              StepId=841.0 request from UID:1000 </div>
            <div dir="auto">GID:1000 HOST:192.168.1.1 PORT:38918</div>
            <div dir="auto">[2021-02-01T23:58:19.789] task/affinity:
              lllp_distribution: JobId=841 </div>
            <div dir="auto">implicit auto binding: cores, dist 1</div>
            <div dir="auto">[2021-02-01T23:58:19.789] task/affinity:
              _task_layout_lllp_cyclic: </div>
            <div dir="auto">_task_layout_lllp_cyclic</div>
            <div dir="auto">[2021-02-01T23:58:19.789] task/affinity:
              _lllp_generate_cpu_bind: </div>
            <div dir="auto">_lllp_generate_cpu_bind jobid [841]:
              mask_cpu, 0x000000000001000000000001</div>
            <div dir="auto">[2021-02-01T23:58:19.825] [841.0] error:
              node10 [1] </div>
            <div dir="auto">pmixp_client_v2.c:246 [pmixp_lib_init]
              mpi/pmix: ERROR: PMIx_server_init </div>
            <div dir="auto">failed with error -2</div>
            <div dir="auto">: Success (0)</div>
            <div dir="auto">[2021-02-01T23:58:19.826] [841.0] error:
              node10 [1] pmixp_client.c:518 </div>
            <div dir="auto">[pmixp_libpmix_init] mpi/pmix: ERROR:
              PMIx_server_init failed with error -1</div>
            <div dir="auto">: Success (0)</div>
            <div dir="auto">[2021-02-01T23:58:19.826] [841.0] error:
              node10 [1] pmixp_server.c:423 </div>
            <div dir="auto">[pmixp_stepd_init] mpi/pmix: ERROR:
              pmixp_libpmix_init() failed</div>
            <div dir="auto">[2021-02-01T23:58:19.826] [841.0] error:
              (null) [1] mpi_pmix.c:169 </div>
            <div dir="auto">[p_mpi_hook_slurmstepd_prefork] mpi/pmix:
              ERROR: pmixp_stepd_init() failed</div>
            <div dir="auto">[2021-02-01T23:58:19.829] [841.0] error:
              Failed mpi_hook_slurmstepd_prefork</div>
            <div dir="auto">[2021-02-01T23:58:19.853] [841.0] error:
              job_manager exiting abnormally, </div>
            <div dir="auto">rc = -1</div>
            <div dir="auto">[2021-02-01T23:58:19.899] [841.0] done with
              job</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">It seems that the culprit is the bind()
              failure, but I can't make much </div>
            <div dir="auto">sense of it. I checked that /etc/hosts has
              everything correct and </div>
            <div dir="auto">consistent with the info in slurm.conf.</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">Other potentially relevant info: all compute
              nodes are diskless, they </div>
            <div dir="auto">are pxe-booted from a NAS image and running
              ubuntu server 20.04. Running </div>
            <div dir="auto">jobs on a single node is fine.</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">Thanks for any insight and suggestions.</div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">Cheers,</div>
            <div dir="auto">Andrej</div>
          </blockquote>
        </div>
        <div dir="auto"><br>
        </div>
      </div>
    </blockquote>
  </body>
</html>