<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Hi,</p>
    <p>Does a regular MPI program run on two nodes? For example
      helloworld:</p>
    <p><a class="moz-txt-link-freetext" href="https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c">https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c</a><br>
    </p>
    <p><a class="moz-txt-link-freetext" href="https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py">https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py</a><br>
    </p>
    <p>Benson<br>
    </p>
    <div class="moz-cite-prefix">On 7/16/19 4:30 PM, Pär Lundö wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:8972685e424e478bbf460e64e82d3407@foi.se">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      <style type="text/css">html { -webkit-text-size-adjust: none; -webkit-touch-callout: none;}</style>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Hi,
      </div>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Thank
        you for your quick answer! </div>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">I’ll
        look into that, but they share the same hosts-file and the
        DHCP-server sets their hostname.
      </div>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>
      </div>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">However
        I came across a setting in the slurm.conf-file ”Tmpfs” and there
        were a note regarding it in the guide of mpi at the slurms
        webpage. I implemented the proposed changes but still no luck.
      </div>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>
      </div>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Best
        regards, </div>
      <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Palle
        <br>
        <br>
        <hr>
        <b>From:</b> "slurm-users"
        <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> <br>
        <b>Sent:</b> 16 juli 2019 12:32 <br>
        <b>To:</b> "Slurm User Community List"
        <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a> <br>
        <b>Subject:</b> Re: [slurm-users] Running pyMPI on several nodes
        <br>
        <br>
      </div>
      <div dir="ltr">
        <div>srun: error: Application launch failed: Invalid node name
          specified <br>
        </div>
        <div><br>
        </div>
        <div>Hearns Law. All batch system problems are DNS problems.  </div>
        <div><br>
        </div>
        <div>Seriously though - check out your name resolution both on
          the head node and the compute nodes.
        </div>
        <div><br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Tue, 16 Jul 2019 at 08:49,
          Pär Lundö < <a href="mailto:par.lundo@foi.se"
            moz-do-not-send="true">
            par.lundo@foi.se</a>> wrote: <br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div bgcolor="#FFFFFF">
            <p>Hi,</p>
            <p>I have now had the time to look at some of your
              suggestions.</p>
            <p>First I tried running "srun -N1 hostname" via a
              sbatch-script, while having two nodes up and running.
              <br>
              "sinfo" yields that two nodes are up and idle prior to
              submitting the sbatch-script.<br>
              After submitting the job, I receive an error stating that:</p>
            <p>"srun: error: Task launch for 86.0 failed on node
              lxclient11: Invalid node name specified.<br>
              srun: error: Application launch failed: Invalid node name
              specified<br>
              srun: Job step aborted: Waiting up to 32 seconds for job
              step to finish.<br>
              srun: error: TImed out waiting for job step to complete"</p>
            <p><br>
            </p>
            <p>From the log file at the client I get a more detailed
              error:<br>
              " Launching batch job 86 for UID 1000 <br>
              [86.batch] error: Invalid host_index -1 for job 86<br>
              [86.batch] error: Host lxclient10 not in hostlist
              lxclient11<br>
              [86.batch] task_pre_launch: Using sched_affinity for tasks<br>
              rpc_launch_tasks: Invalid node list (lxclient10 not in
              lxclient11)"<br>
            </p>
            <p>My two nodes are called lxclient10 and lxclient11.<br>
              Why is my batch job launched with the UID 1000, shouldnt
              it be launched via the slurm-user (which in my case has
              the UID 64030)?
              <br>
              What is meant by that the different nodes are not in the
              nodeslist? <br>
              The two nodes and the server share the same setup of
              IP-addresses in the "/etc/hosts"-file.</p>
            <p>-> This was resolved due to that lxclient10 was noted
              as down. Getting it back up, the submitting of the same
              sbatch-script, resulted in no error.<br>
              However running it on two nodes I get an error <br>
              "srun: error: Job Step 88.0 aborted before step completely
              launched.<br>
              srun: error: Job step aborted: Waiting up to 32 seconds
              for job step to finish.<br>
              srun: error: task 1 launched failed: Unspecifed error<br>
              srun: error: lxclient10: task 0: Killed"</p>
            <p>And in the slurmctld.log-file from the client I get an
              error similiar to that prevously stated, that the pmix
              cannot bind UNIX socket
              /var/spool/slurmd/stepd.slurm.pmix.88.0: Address already
              in use (98)</p>
            <p>I ran the lsof command, but I dont really know what I am
              looking after, I can see if I grep with the different
              nodenames that the two nodes have mounted the
              nfs-partition and that a link is established.</p>
            <p>"As an aside, you have checked that your username exists
              on that compue server?      getent passwd par<br>
              Also that your home directory is mounted - or something
              substituting for your home directory?"<br>
              Yes, the user slurm exists on both nodes and have the same
              uid.<br>
              <br>
              "Have you tried <br>
              <br>
              <br>
                      srun -N# -n# mpirun python3 .... <br>
              <br>
              <br>
              Perhaps you have no MPI environment being setup for the
              processes?  There was no "--mpi" flag in your "srun"
              command and we don't know if you have a default value for
              that or not.
              <br>
            </p>
            "
            <p>In my slurm.conf-file I do specify that "MpiDefault=pmix"
              (And it can be seen in the logfile that there is something
              wrong with pmix, that the address already in use.)<br>
            </p>
            <p>One thing that struck my mind now is that I run these
              nodes as a pair of diskless nodes, whom boots and mounts
              the same filesystem which is supplied by a server. The run
              differen pids for different processes which should not
              affect one another(?), right?</p>
            <p><br>
            </p>
            <p>Best regards,</p>
            <p>Palle<br>
            </p>
            <div class="gmail-m_5257842815186023450moz-cite-prefix">On
              2019-07-12 19:34, Pär Lundö wrote:
              <br>
            </div>
            <blockquote>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Hi,
              </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>
              </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Thank
                you so much for your quick responses! </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">It
                is much appreciated. </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">I
                dont have access to the cluster until next week, but
                I’ll be sure to follow up on all of your suggestions and
                get back you next week.
              </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>
              </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Have
                a nice weekend! </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Best
                regards </div>
              <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Palle
                <br>
                <br>
                <hr>
                <b>From:</b> "slurm-users" <a
                  class="gmail-m_5257842815186023450moz-txt-link-rfc2396E"
                  href="mailto:slurm-users-bounces@lists.schedmd.com"
                  target="_blank" moz-do-not-send="true">
                  <slurm-users-bounces@lists.schedmd.com></a> <br>
                <b>Sent:</b> 12 juli 2019 17:37 <br>
                <b>To:</b> "Slurm User Community List" <a
                  class="gmail-m_5257842815186023450moz-txt-link-rfc2396E"
                  href="mailto:slurm-users@lists.schedmd.com"
                  target="_blank" moz-do-not-send="true">
                  <slurm-users@lists.schedmd.com></a> <br>
                <b>Subject:</b> Re: [slurm-users] Running pyMPI on
                several nodes <br>
                <br>
              </div>
              <div dir="ltr">
                <div>Par, by 'poking around' Crhis means to use tools
                  such as netstat and lsof. </div>
                <div>Also I would look as ps -eaf --forest to make sure
                  there are no 'orphaned' jusbs sitting on that compute
                  node.
                </div>
                <div><br>
                </div>
                <div>Having said that though, I have a dim memory of a
                  classic PBSPro error message which says something
                  about a network connection,
                </div>
                <div>but really means that you cannot open a remote
                  session on that compute server.
                </div>
                <div><br>
                </div>
                <div>As an aside, you have checked that your username
                  exists on that compue server?      getent passwd par
                    
                </div>
                <div>Also that your home directory is mounted - or
                  something substituting for your home directory?
                </div>
                <div><br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div class="gmail_attr" dir="ltr">On Fri, 12 Jul 2019 at
                  15:55, Chris Samuel < <a
                    href="mailto:chris@csamuel.org" target="_blank"
                    moz-do-not-send="true">
                    chris@csamuel.org</a>> wrote: <br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;padding-left:1ex;border-left:1px solid
                  rgb(204,204,204)">
                  On 12/7/19 7:39 am, Pär Lundö wrote: <br>
                  <br>
                  > Presumably, the first 8 tasks originates from the
                  first node (in this <br>
                  > case the lxclient11), and the other node
                  (lxclient10) response as <br>
                  > predicted. <br>
                  <br>
                  That looks right, it seems the other node has two
                  processes fighting <br>
                  over the same socket and that's breaking Slurm there.
                  <br>
                  <br>
                  > Is it neccessary to have passwordless ssh
                  communication alongside the <br>
                  > munge authentication? <br>
                  <br>
                  No, srun doesn't need (or use) that at all. <br>
                  <br>
                  > In addition I checked the slurmctld-log from both
                  the server and client <br>
                  > and found something (noted in bold): <br>
                  <br>
                  This is from the slurmd log on the client from the
                  look of it. <br>
                  <br>
                  > *[2019-07-12T14:57:53.771][83.0]
                  task_p_pre_launch: Using sched affinity <br>
                  > for tasks lurm.pmix.83.0: Address already in
                  use[98]* <br>
                  > [2019-07-12T14:57:53.682][83.0] error:
                  lxclient[0] /pmix.server.c:386 <br>
                  > [pmix_stepd_init] mpi/pmix: ERROR:
                  pmixp_usock_create_srv <br>
                  > [2019-07-12T14:57:53.683][83.0] error: (null) [0]
                  /mpi_pmix:156 <br>
                  > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR:
                  pmixp_stepd_init() failed <br>
                  <br>
                  That indicates that something else has grabbed the
                  socket it wants and <br>
                  that's why the setup of the MPI ranks on the second
                  node fails. <br>
                  <br>
                  You'll want to poke around there to see what's using
                  it. <br>
                  <br>
                  Best of luck! <br>
                  Chris <br>
                  -- <br>
                    Chris Samuel  :  <a href="http://www.csamuel.org/"
                    rel="noreferrer" target="_blank"
                    moz-do-not-send="true">
                    http://www.csamuel.org/</a>  :  Berkeley, CA, USA <br>
                  <br>
                </blockquote>
              </div>
            </blockquote>
            <pre class="gmail-m_5257842815186023450moz-signature">-- 
Hälsningar, Pär
________________________________
Pär Lundö 
Forskare
Avdelningen för Ledningssystem 

FOI
Totalförsvarets forskningsinstitut
164 90 Stockholm 

Besöksadress:
Olau Magnus väg 33, Linköping


Tel: <a dir="ltr" href="tel:+46%2013%2037%2086%2001" x-apple-data-detectors="true" x-apple-data-detectors-type="telephone" x-apple-data-detectors-result="13" moz-do-not-send="true">+46 13 37 86 01</a>
Mob: <a dir="ltr" href="tel:+46%20734%20447%20815" x-apple-data-detectors="true" x-apple-data-detectors-type="telephone" x-apple-data-detectors-result="14/0" moz-do-not-send="true">+46 734 447 815</a>
Vxl: <a dir="ltr" href="tel:+46%2013%2037%2080%2000" x-apple-data-detectors="true" x-apple-data-detectors-type="telephone" x-apple-data-detectors-result="14/1" moz-do-not-send="true">+46 13 37 80 00</a>
<a class="gmail-m_5257842815186023450moz-txt-link-abbreviated" href="mailto:par.lundo@foi.se" target="_blank" moz-do-not-send="true">par.lundo@foi.se</a>
<a class="gmail-m_5257842815186023450moz-txt-link-abbreviated" href="http://www.foi.se" target="_blank" moz-do-not-send="true">www.foi.se</a></pre>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>