<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Hi Pälle!<br>
    </p>
    <p>Great. It would be helpful to know how they shared the etc
      directory? NFS?</p>
    <p>Benson<br>
    </p>
    <div class="moz-cite-prefix">On 8/13/19 9:25 AM, Pär Lundö wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:9c36714a-03c9-64eb-6af2-63ad2c55eae8@foi.se">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      <p>Hi!</p>
      <p>I have now had the chance to look into to this matter more
        thoroughly and it seems that the problem was due to the fact
        that the nodes are diskless and shared some data (e.g.
        "etc"-dir). I removed that dependency and mounted each node to a
        unique set of folders, which resolved the issue. Presumably,
        this can be done in other ways unknown to me, but it helped me
        and I can now run multiple nodes via MPI.</p>
      <p>Thank you for your help!</p>
      <p>Best regards,<br>
        Pälle L<br>
      </p>
      <div class="moz-cite-prefix">On 2019-07-16 15:49, Benson Muite
        wrote:<br>
      </div>
      <blockquote type="cite"
        cite="mid:6c81032b-5a45-3565-1df8-c9fc6a9192f2@emailplus.org">
        <meta http-equiv="Content-Type" content="text/html;
          charset=windows-1252">
        <p>Hi,</p>
        <p>Does a regular MPI program run on two nodes? For example
          helloworld:</p>
        <p><a class="moz-txt-link-freetext"
            href="https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c"
            moz-do-not-send="true">https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c</a><br>
        </p>
        <p><a class="moz-txt-link-freetext"
href="https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py"
            moz-do-not-send="true">https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py</a><br>
        </p>
        <p>Benson<br>
        </p>
        <div class="moz-cite-prefix">On 7/16/19 4:30 PM, Pär Lundö
          wrote:<br>
        </div>
        <blockquote type="cite"
          cite="mid:8972685e424e478bbf460e64e82d3407@foi.se">
          <style type="text/css">html { -webkit-text-size-adjust: none; -webkit-touch-callout: none;}</style>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Hi,
          </div>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Thank
            you for your quick answer! </div>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">I’ll
            look into that, but they share the same hosts-file and the
            DHCP-server sets their hostname. </div>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>
          </div>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">However
            I came across a setting in the slurm.conf-file ”Tmpfs” and
            there were a note regarding it in the guide of mpi at the
            slurms webpage. I implemented the proposed changes but still
            no luck. </div>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>
          </div>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Best
            regards, </div>
          <div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Palle
            <br>
            <br>
            <hr> <b>From:</b> "slurm-users" <a
              class="moz-txt-link-rfc2396E"
              href="mailto:slurm-users-bounces@lists.schedmd.com"
              moz-do-not-send="true"><slurm-users-bounces@lists.schedmd.com></a>
            <br>
            <b>Sent:</b> 16 juli 2019 12:32 <br>
            <b>To:</b> "Slurm User Community List" <a
              class="moz-txt-link-rfc2396E"
              href="mailto:slurm-users@lists.schedmd.com"
              moz-do-not-send="true"><slurm-users@lists.schedmd.com></a>
            <br>
            <b>Subject:</b> Re: [slurm-users] Running pyMPI on several
            nodes <br>
            <br>
          </div>
          <div dir="ltr">
            <div>srun: error: Application launch failed: Invalid node
              name specified <br>
            </div>
            <div><br>
            </div>
            <div>Hearns Law. All batch system problems are DNS
              problems.  </div>
            <div><br>
            </div>
            <div>Seriously though - check out your name resolution both
              on the head node and the compute nodes. </div>
            <div><br>
            </div>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Tue, 16 Jul 2019 at
              08:49, Pär Lundö < <a href="mailto:par.lundo@foi.se"
                moz-do-not-send="true"> par.lundo@foi.se</a>> wrote:
              <br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF">
                <p>Hi,</p>
                <p>I have now had the time to look at some of your
                  suggestions.</p>
                <p>First I tried running "srun -N1 hostname" via a
                  sbatch-script, while having two nodes up and running.
                  <br>
                  "sinfo" yields that two nodes are up and idle prior to
                  submitting the sbatch-script.<br>
                  After submitting the job, I receive an error stating
                  that:</p>
                <p>"srun: error: Task launch for 86.0 failed on node
                  lxclient11: Invalid node name specified.<br>
                  srun: error: Application launch failed: Invalid node
                  name specified<br>
                  srun: Job step aborted: Waiting up to 32 seconds for
                  job step to finish.<br>
                  srun: error: TImed out waiting for job step to
                  complete"</p>
                <p><br>
                </p>
                <p>From the log file at the client I get a more detailed
                  error:<br>
                  " Launching batch job 86 for UID 1000 <br>
                  [86.batch] error: Invalid host_index -1 for job 86<br>
                  [86.batch] error: Host lxclient10 not in hostlist
                  lxclient11<br>
                  [86.batch] task_pre_launch: Using sched_affinity for
                  tasks<br>
                  rpc_launch_tasks: Invalid node list (lxclient10 not in
                  lxclient11)"<br>
                </p>
                <p>My two nodes are called lxclient10 and lxclient11.<br>
                  Why is my batch job launched with the UID 1000,
                  shouldnt it be launched via the slurm-user (which in
                  my case has the UID 64030)? <br>
                  What is meant by that the different nodes are not in
                  the nodeslist? <br>
                  The two nodes and the server share the same setup of
                  IP-addresses in the "/etc/hosts"-file.</p>
                <p>-> This was resolved due to that lxclient10 was
                  noted as down. Getting it back up, the submitting of
                  the same sbatch-script, resulted in no error.<br>
                  However running it on two nodes I get an error <br>
                  "srun: error: Job Step 88.0 aborted before step
                  completely launched.<br>
                  srun: error: Job step aborted: Waiting up to 32
                  seconds for job step to finish.<br>
                  srun: error: task 1 launched failed: Unspecifed error<br>
                  srun: error: lxclient10: task 0: Killed"</p>
                <p>And in the slurmctld.log-file from the client I get
                  an error similiar to that prevously stated, that the
                  pmix cannot bind UNIX socket
                  /var/spool/slurmd/stepd.slurm.pmix.88.0: Address
                  already in use (98)</p>
                <p>I ran the lsof command, but I dont really know what I
                  am looking after, I can see if I grep with the
                  different nodenames that the two nodes have mounted
                  the nfs-partition and that a link is established.</p>
                <p>"As an aside, you have checked that your username
                  exists on that compue server?      getent passwd par<br>
                  Also that your home directory is mounted - or
                  something substituting for your home directory?"<br>
                  Yes, the user slurm exists on both nodes and have the
                  same uid.<br>
                  <br>
                  "Have you tried <br>
                  <br>
                  <br>
                          srun -N# -n# mpirun python3 .... <br>
                  <br>
                  <br>
                  Perhaps you have no MPI environment being setup for
                  the processes?  There was no "--mpi" flag in your
                  "srun" command and we don't know if you have a default
                  value for that or not. <br>
                </p>
                "
                <p>In my slurm.conf-file I do specify that
                  "MpiDefault=pmix" (And it can be seen in the logfile
                  that there is something wrong with pmix, that the
                  address already in use.)<br>
                </p>
                <p>One thing that struck my mind now is that I run these
                  nodes as a pair of diskless nodes, whom boots and
                  mounts the same filesystem which is supplied by a
                  server. The run differen pids for different processes
                  which should not affect one another(?), right?</p>
                <p><br>
                </p>
                <p>Best regards,</p>
                <p>Palle<br>
                </p>
                <div class="gmail-m_5257842815186023450moz-cite-prefix">On
                  2019-07-12 19:34, Pär Lundö wrote: <br>
                </div>
                <blockquote>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Hi,
                  </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>
                  </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Thank
                    you so much for your quick responses! </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">It
                    is much appreciated. </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">I
                    dont have access to the cluster until next week, but
                    I’ll be sure to follow up on all of your suggestions
                    and get back you next week. </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>
                  </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Have
                    a nice weekend! </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Best
                    regards </div>
                  <div dir="ltr"
style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Palle
                    <br>
                    <br>
                    <hr> <b>From:</b> "slurm-users" <a
                      class="gmail-m_5257842815186023450moz-txt-link-rfc2396E"
href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank"
                      moz-do-not-send="true">
                      <slurm-users-bounces@lists.schedmd.com></a>
                    <br>
                    <b>Sent:</b> 12 juli 2019 17:37 <br>
                    <b>To:</b> "Slurm User Community List" <a
                      class="gmail-m_5257842815186023450moz-txt-link-rfc2396E"
                      href="mailto:slurm-users@lists.schedmd.com"
                      target="_blank" moz-do-not-send="true">
                      <slurm-users@lists.schedmd.com></a> <br>
                    <b>Subject:</b> Re: [slurm-users] Running pyMPI on
                    several nodes <br>
                    <br>
                  </div>
                  <div dir="ltr">
                    <div>Par, by 'poking around' Crhis means to use
                      tools such as netstat and lsof. </div>
                    <div>Also I would look as ps -eaf --forest to make
                      sure there are no 'orphaned' jusbs sitting on that
                      compute node. </div>
                    <div><br>
                    </div>
                    <div>Having said that though, I have a dim memory of
                      a classic PBSPro error message which says
                      something about a network connection, </div>
                    <div>but really means that you cannot open a remote
                      session on that compute server. </div>
                    <div><br>
                    </div>
                    <div>As an aside, you have checked that your
                      username exists on that compue server?      getent
                      passwd par    </div>
                    <div>Also that your home directory is mounted - or
                      something substituting for your home directory? </div>
                    <div><br>
                    </div>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div class="gmail_attr" dir="ltr">On Fri, 12 Jul
                      2019 at 15:55, Chris Samuel < <a
                        href="mailto:chris@csamuel.org" target="_blank"
                        moz-do-not-send="true"> chris@csamuel.org</a>>
                      wrote: <br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;padding-left:1ex;border-left:1px
                      solid rgb(204,204,204)"> On 12/7/19 7:39 am, Pär
                      Lundö wrote: <br>
                      <br>
                      > Presumably, the first 8 tasks originates from
                      the first node (in this <br>
                      > case the lxclient11), and the other node
                      (lxclient10) response as <br>
                      > predicted. <br>
                      <br>
                      That looks right, it seems the other node has two
                      processes fighting <br>
                      over the same socket and that's breaking Slurm
                      there. <br>
                      <br>
                      > Is it neccessary to have passwordless ssh
                      communication alongside the <br>
                      > munge authentication? <br>
                      <br>
                      No, srun doesn't need (or use) that at all. <br>
                      <br>
                      > In addition I checked the slurmctld-log from
                      both the server and client <br>
                      > and found something (noted in bold): <br>
                      <br>
                      This is from the slurmd log on the client from the
                      look of it. <br>
                      <br>
                      > *[2019-07-12T14:57:53.771][83.0]
                      task_p_pre_launch: Using sched affinity <br>
                      > for tasks lurm.pmix.83.0: Address already in
                      use[98]* <br>
                      > [2019-07-12T14:57:53.682][83.0] error:
                      lxclient[0] /pmix.server.c:386 <br>
                      > [pmix_stepd_init] mpi/pmix: ERROR:
                      pmixp_usock_create_srv <br>
                      > [2019-07-12T14:57:53.683][83.0] error: (null)
                      [0] /mpi_pmix:156 <br>
                      > [p_mpi_hook_slurmstepd_prefork] mpi/pmix:
                      ERROR: pmixp_stepd_init() failed <br>
                      <br>
                      That indicates that something else has grabbed the
                      socket it wants and <br>
                      that's why the setup of the MPI ranks on the
                      second node fails. <br>
                      <br>
                      You'll want to poke around there to see what's
                      using it. <br>
                      <br>
                      Best of luck! <br>
                      Chris <br>
                      -- <br>
                        Chris Samuel  :  <a
                        href="http://www.csamuel.org/" rel="noreferrer"
                        target="_blank" moz-do-not-send="true">
                        http://www.csamuel.org/</a>  :  Berkeley, CA,
                      USA <br>
                      <br>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </blockquote>
    </blockquote>
  </body>
</html>