<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hi Benson!</p>

    <p>Yeah, it was via an NFS-share.</p>

    <p>Best regards,<br>

      Pälle<br>

    </p>

    <div class="moz-cite-prefix">On 2019-08-13 08:30, Benson Muite

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:2f0137ec-820f-6444-69f8-4dda2965ee06@emailplus.org">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <p>Hi Pälle!<br>

      </p>

      <p>Great. It would be helpful to know how they shared the etc

        directory? NFS?</p>

      <p>Benson<br>

      </p>

      <div class="moz-cite-prefix">On 8/13/19 9:25 AM, Pär Lundö wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:9c36714a-03c9-64eb-6af2-63ad2c55eae8@foi.se">

        <p>Hi!</p>

        <p>I have now had the chance to look into to this matter more

          thoroughly and it seems that the problem was due to the fact

          that the nodes are diskless and shared some data (e.g.

          "etc"-dir). I removed that dependency and mounted each node to

          a unique set of folders, which resolved the issue. Presumably,

          this can be done in other ways unknown to me, but it helped me

          and I can now run multiple nodes via MPI.</p>

        <p>Thank you for your help!</p>

        <p>Best regards,<br>

          Pälle L<br>

        </p>

        <div class="moz-cite-prefix">On 2019-07-16 15:49, Benson Muite

          wrote:<br>

        </div>

        <blockquote type="cite"

          cite="mid:6c81032b-5a45-3565-1df8-c9fc6a9192f2@emailplus.org">

          <p>Hi,</p>

          <p>Does a regular MPI program run on two nodes? For example

            helloworld:</p>

          <p><a class="moz-txt-link-freetext"

              href="https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c"

              moz-do-not-send="true">https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c</a><br>

          </p>

          <p><a class="moz-txt-link-freetext"

href="https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py"

              moz-do-not-send="true">https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py</a><br>

          </p>

          <p>Benson<br>

          </p>

          <div class="moz-cite-prefix">On 7/16/19 4:30 PM, Pär Lundö

            wrote:<br>

          </div>

          <blockquote type="cite"

            cite="mid:8972685e424e478bbf460e64e82d3407@foi.se">

            <style type="text/css">html { -webkit-text-size-adjust: none; -webkit-touch-callout: none;}</style>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Hi,

            </div>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Thank

              you for your quick answer! </div>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">I’ll

              look into that, but they share the same hosts-file and the

              DHCP-server sets their hostname. </div>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>

            </div>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">However

              I came across a setting in the slurm.conf-file ”Tmpfs” and

              there were a note regarding it in the guide of mpi at the

              slurms webpage. I implemented the proposed changes but

              still no luck. </div>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>

            </div>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Best

              regards, </div>

            <div dir="ltr"

style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Palle

              <br>

              <br>

              <hr> <b>From:</b> "slurm-users" <a

                class="moz-txt-link-rfc2396E"

                href="mailto:slurm-users-bounces@lists.schedmd.com"

                moz-do-not-send="true"><slurm-users-bounces@lists.schedmd.com></a>

              <br>

              <b>Sent:</b> 16 juli 2019 12:32 <br>

              <b>To:</b> "Slurm User Community List" <a

                class="moz-txt-link-rfc2396E"

                href="mailto:slurm-users@lists.schedmd.com"

                moz-do-not-send="true"><slurm-users@lists.schedmd.com></a>

              <br>

              <b>Subject:</b> Re: [slurm-users] Running pyMPI on several

              nodes <br>

              <br>

            </div>

            <div dir="ltr">

              <div>srun: error: Application launch failed: Invalid node

                name specified <br>

              </div>

              <div><br>

              </div>

              <div>Hearns Law. All batch system problems are DNS

                problems.  </div>

              <div><br>

              </div>

              <div>Seriously though - check out your name resolution

                both on the head node and the compute nodes. </div>

              <div><br>

              </div>

            </div>

            <br>

            <div class="gmail_quote">

              <div dir="ltr" class="gmail_attr">On Tue, 16 Jul 2019 at

                08:49, Pär Lundö < <a href="mailto:par.lundo@foi.se"

                  moz-do-not-send="true"> par.lundo@foi.se</a>>

                wrote: <br>

              </div>

              <blockquote class="gmail_quote" style="margin:0px 0px 0px

                0.8ex;border-left:1px solid

                rgb(204,204,204);padding-left:1ex">

                <div bgcolor="#FFFFFF">

                  <p>Hi,</p>

                  <p>I have now had the time to look at some of your

                    suggestions.</p>

                  <p>First I tried running "srun -N1 hostname" via a

                    sbatch-script, while having two nodes up and

                    running. <br>

                    "sinfo" yields that two nodes are up and idle prior

                    to submitting the sbatch-script.<br>

                    After submitting the job, I receive an error stating

                    that:</p>

                  <p>"srun: error: Task launch for 86.0 failed on node

                    lxclient11: Invalid node name specified.<br>

                    srun: error: Application launch failed: Invalid node

                    name specified<br>

                    srun: Job step aborted: Waiting up to 32 seconds for

                    job step to finish.<br>

                    srun: error: TImed out waiting for job step to

                    complete"</p>

                  <p><br>

                  </p>

                  <p>From the log file at the client I get a more

                    detailed error:<br>

                    " Launching batch job 86 for UID 1000 <br>

                    [86.batch] error: Invalid host_index -1 for job 86<br>

                    [86.batch] error: Host lxclient10 not in hostlist

                    lxclient11<br>

                    [86.batch] task_pre_launch: Using sched_affinity for

                    tasks<br>

                    rpc_launch_tasks: Invalid node list (lxclient10 not

                    in lxclient11)"<br>

                  </p>

                  <p>My two nodes are called lxclient10 and lxclient11.<br>

                    Why is my batch job launched with the UID 1000,

                    shouldnt it be launched via the slurm-user (which in

                    my case has the UID 64030)? <br>

                    What is meant by that the different nodes are not in

                    the nodeslist? <br>

                    The two nodes and the server share the same setup of

                    IP-addresses in the "/etc/hosts"-file.</p>

                  <p>-> This was resolved due to that lxclient10 was

                    noted as down. Getting it back up, the submitting of

                    the same sbatch-script, resulted in no error.<br>

                    However running it on two nodes I get an error <br>

                    "srun: error: Job Step 88.0 aborted before step

                    completely launched.<br>

                    srun: error: Job step aborted: Waiting up to 32

                    seconds for job step to finish.<br>

                    srun: error: task 1 launched failed: Unspecifed

                    error<br>

                    srun: error: lxclient10: task 0: Killed"</p>

                  <p>And in the slurmctld.log-file from the client I get

                    an error similiar to that prevously stated, that the

                    pmix cannot bind UNIX socket

                    /var/spool/slurmd/stepd.slurm.pmix.88.0: Address

                    already in use (98)</p>

                  <p>I ran the lsof command, but I dont really know what

                    I am looking after, I can see if I grep with the

                    different nodenames that the two nodes have mounted

                    the nfs-partition and that a link is established.</p>

                  <p>"As an aside, you have checked that your username

                    exists on that compue server?      getent passwd par<br>

                    Also that your home directory is mounted - or

                    something substituting for your home directory?"<br>

                    Yes, the user slurm exists on both nodes and have

                    the same uid.<br>

                    <br>

                    "Have you tried <br>

                    <br>

                    <br>

                            srun -N# -n# mpirun python3 .... <br>

                    <br>

                    <br>

                    Perhaps you have no MPI environment being setup for

                    the processes?  There was no "--mpi" flag in your

                    "srun" command and we don't know if you have a

                    default value for that or not. <br>

                  </p>

                  "

                  <p>In my slurm.conf-file I do specify that

                    "MpiDefault=pmix" (And it can be seen in the logfile

                    that there is something wrong with pmix, that the

                    address already in use.)<br>

                  </p>

                  <p>One thing that struck my mind now is that I run

                    these nodes as a pair of diskless nodes, whom boots

                    and mounts the same filesystem which is supplied by

                    a server. The run differen pids for different

                    processes which should not affect one another(?),

                    right?</p>

                  <p><br>

                  </p>

                  <p>Best regards,</p>

                  <p>Palle<br>

                  </p>

                  <div

                    class="gmail-m_5257842815186023450moz-cite-prefix">On

                    2019-07-12 19:34, Pär Lundö wrote: <br>

                  </div>

                  <blockquote>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Hi,

                    </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>

                    </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Thank

                      you so much for your quick responses! </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">It

                      is much appreciated. </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">I

                      dont have access to the cluster until next week,

                      but I’ll be sure to follow up on all of your

                      suggestions and get back you next week. </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>

                    </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Have

                      a nice weekend! </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Best

                      regards </div>

                    <div dir="ltr"

style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Palle

                      <br>

                      <br>

                      <hr> <b>From:</b> "slurm-users" <a

                        class="gmail-m_5257842815186023450moz-txt-link-rfc2396E"

href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank"

                        moz-do-not-send="true">

                        <slurm-users-bounces@lists.schedmd.com></a>

                      <br>

                      <b>Sent:</b> 12 juli 2019 17:37 <br>

                      <b>To:</b> "Slurm User Community List" <a

                        class="gmail-m_5257842815186023450moz-txt-link-rfc2396E"

                        href="mailto:slurm-users@lists.schedmd.com"

                        target="_blank" moz-do-not-send="true">

                        <slurm-users@lists.schedmd.com></a> <br>

                      <b>Subject:</b> Re: [slurm-users] Running pyMPI on

                      several nodes <br>

                      <br>

                    </div>

                    <div dir="ltr">

                      <div>Par, by 'poking around' Crhis means to use

                        tools such as netstat and lsof. </div>

                      <div>Also I would look as ps -eaf --forest to make

                        sure there are no 'orphaned' jusbs sitting on

                        that compute node. </div>

                      <div><br>

                      </div>

                      <div>Having said that though, I have a dim memory

                        of a classic PBSPro error message which says

                        something about a network connection, </div>

                      <div>but really means that you cannot open a

                        remote session on that compute server. </div>

                      <div><br>

                      </div>

                      <div>As an aside, you have checked that your

                        username exists on that compue server?     

                        getent passwd par    </div>

                      <div>Also that your home directory is mounted - or

                        something substituting for your home directory?

                      </div>

                      <div><br>

                      </div>

                    </div>

                    <br>

                    <div class="gmail_quote">

                      <div class="gmail_attr" dir="ltr">On Fri, 12 Jul

                        2019 at 15:55, Chris Samuel < <a

                          href="mailto:chris@csamuel.org"

                          target="_blank" moz-do-not-send="true">

                          chris@csamuel.org</a>> wrote: <br>

                      </div>

                      <blockquote class="gmail_quote" style="margin:0px

                        0px 0px 0.8ex;padding-left:1ex;border-left:1px

                        solid rgb(204,204,204)"> On 12/7/19 7:39 am, Pär

                        Lundö wrote: <br>

                        <br>

                        > Presumably, the first 8 tasks originates

                        from the first node (in this <br>

                        > case the lxclient11), and the other node

                        (lxclient10) response as <br>

                        > predicted. <br>

                        <br>

                        That looks right, it seems the other node has

                        two processes fighting <br>

                        over the same socket and that's breaking Slurm

                        there. <br>

                        <br>

                        > Is it neccessary to have passwordless ssh

                        communication alongside the <br>

                        > munge authentication? <br>

                        <br>

                        No, srun doesn't need (or use) that at all. <br>

                        <br>

                        > In addition I checked the slurmctld-log

                        from both the server and client <br>

                        > and found something (noted in bold): <br>

                        <br>

                        This is from the slurmd log on the client from

                        the look of it. <br>

                        <br>

                        > *[2019-07-12T14:57:53.771][83.0]

                        task_p_pre_launch: Using sched affinity <br>

                        > for tasks lurm.pmix.83.0: Address already

                        in use[98]* <br>

                        > [2019-07-12T14:57:53.682][83.0] error:

                        lxclient[0] /pmix.server.c:386 <br>

                        > [pmix_stepd_init] mpi/pmix: ERROR:

                        pmixp_usock_create_srv <br>

                        > [2019-07-12T14:57:53.683][83.0] error:

                        (null) [0] /mpi_pmix:156 <br>

                        > [p_mpi_hook_slurmstepd_prefork] mpi/pmix:

                        ERROR: pmixp_stepd_init() failed <br>

                        <br>

                        That indicates that something else has grabbed

                        the socket it wants and <br>

                        that's why the setup of the MPI ranks on the

                        second node fails. <br>

                        <br>

                        You'll want to poke around there to see what's

                        using it. <br>

                        <br>

                        Best of luck! <br>

                        Chris <br>

                        -- <br>

                          Chris Samuel  :  <a

                          href="http://www.csamuel.org/"

                          rel="noreferrer" target="_blank"

                          moz-do-not-send="true">

                          http://www.csamuel.org/</a>  :  Berkeley, CA,

                        USA <br>

                        <br>

                      </blockquote>

                    </div>

                  </blockquote>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </blockquote>

      </blockquote>

    </blockquote>

    <pre class="moz-signature" cols="72">

</pre>

  </body>

</html>