<div dir="ltr"><div>srun: error: Application launch failed: Invalid node name specified<br></div><div><br></div><div>Hearns Law. All batch system problems are DNS problems. </div><div><br></div><div>Seriously though - check out your name resolution both on the head node and the compute nodes.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 16 Jul 2019 at 08:49, Pär Lundö <<a href="mailto:par.lundo@foi.se">par.lundo@foi.se</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF">

    <p>Hi,</p>

    <p>I have now had the time to look at some of your suggestions.</p>

    <p>First I tried running "srun -N1 hostname" via a sbatch-script,

      while having two nodes up and running. <br>

      "sinfo" yields that two nodes are up and idle prior to submitting

      the sbatch-script.<br>

      After submitting the job, I receive an error stating that:</p>

    <p>"srun: error: Task launch for 86.0 failed on node lxclient11:

      Invalid node name specified.<br>

      srun: error: Application launch failed: Invalid node name

      specified<br>

      srun: Job step aborted: Waiting up to 32 seconds for job step to

      finish.<br>

      srun: error: TImed out waiting for job step to complete"</p>

    <p><br>

    </p>

    <p>From the log file at the client I get a more detailed error:<br>

      " Launching batch job 86 for UID 1000 <br>

      [86.batch] error: Invalid host_index -1 for job 86<br>

      [86.batch] error: Host lxclient10 not in hostlist lxclient11<br>

      [86.batch] task_pre_launch: Using sched_affinity for tasks<br>

      rpc_launch_tasks: Invalid node list (lxclient10 not in

      lxclient11)"<br>

    </p>

    <p>My two nodes are called lxclient10 and lxclient11.<br>

      Why is my batch job launched with the UID 1000, shouldnt it be

      launched via the slurm-user (which in my case has the UID 64030)?

      <br>

      What is meant by that the different nodes are not in the

      nodeslist? <br>

      The two nodes and the server share the same setup of IP-addresses

      in the "/etc/hosts"-file.</p>

    <p>-> This was resolved due to that lxclient10 was noted as down.

      Getting it back up, the submitting of the same sbatch-script,

      resulted in no error.<br>

      However running it on two nodes I get an error <br>

      "srun: error: Job Step 88.0 aborted before step completely

      launched.<br>

      srun: error: Job step aborted: Waiting up to 32 seconds for job

      step to finish.<br>

      srun: error: task 1 launched failed: Unspecifed error<br>

      srun: error: lxclient10: task 0: Killed"</p>

    <p>And in the slurmctld.log-file from the client I get an error

      similiar to that prevously stated, that the pmix cannot bind UNIX

      socket /var/spool/slurmd/stepd.slurm.pmix.88.0: Address already in

      use (98)</p>

    <p>I ran the lsof command, but I dont really know what I am looking

      after, I can see if I grep with the different nodenames that the

      two nodes have mounted the nfs-partition and that a link is

      established.</p>

    <p>"As an aside, you have checked that your username exists on that

      compue server?      getent passwd par<br>

      Also that your home directory is mounted - or something

      substituting for your home directory?"<br>

      Yes, the user slurm exists on both nodes and have the same uid.<br>

      <br>

      "Have you tried

      <br>

      <br>

      <br>

              srun -N# -n# mpirun python3 ....

      <br>

      <br>

      <br>

      Perhaps you have no MPI environment being setup for the processes?

       There was no "--mpi" flag in your "srun" command and we don't

      know if you have a default value for that or not.

      <br>

    </p>

    "

    <p>In my slurm.conf-file I do specify that "MpiDefault=pmix" (And it

      can be seen in the logfile that there is something wrong with

      pmix, that the address already in use.)<br>

    </p>

    <p>One thing that struck my mind now is that I run these nodes as a

      pair of diskless nodes, whom boots and mounts the same filesystem

      which is supplied by a server. The run differen pids for different

      processes which should not affect one another(?), right?</p>

    <p><br>

    </p>

    <p>Best regards,</p>

    <p>Palle<br>

    </p>

    <div class="gmail-m_5257842815186023450moz-cite-prefix">On 2019-07-12 19:34, Pär Lundö wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Hi,

      </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>

      </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Thank

        you so much for your quick responses! </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">It

        is much appreciated. </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">I

        dont have access to the cluster until next week, but I’ll be

        sure to follow up on all of your suggestions and get back you

        next week.

      </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black"><br>

      </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Have

        a nice weekend! </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Best

        regards </div>

      <div dir="ltr" style="font-size:11pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black">Palle

        <br>

        <br>

        <hr>

        <b>From:</b> "slurm-users"

        <a class="gmail-m_5257842815186023450moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank"><slurm-users-bounces@lists.schedmd.com></a> <br>

        <b>Sent:</b> 12 juli 2019 17:37 <br>

        <b>To:</b> "Slurm User Community List"

        <a class="gmail-m_5257842815186023450moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com" target="_blank"><slurm-users@lists.schedmd.com></a> <br>

        <b>Subject:</b> Re: [slurm-users] Running pyMPI on several nodes

        <br>

        <br>

      </div>

      <div dir="ltr">

        <div>Par, by 'poking around' Crhis means to use tools such as

          netstat and lsof. </div>

        <div>Also I would look as ps -eaf --forest to make sure there

          are no 'orphaned' jusbs sitting on that compute node.

        </div>

        <div><br>

        </div>

        <div>Having said that though, I have a dim memory of a classic

          PBSPro error message which says something about a network

          connection,

        </div>

        <div>but really means that you cannot open a remote session on

          that compute server.

        </div>

        <div><br>

        </div>

        <div>As an aside, you have checked that your username exists on

          that compue server?      getent passwd par   

        </div>

        <div>Also that your home directory is mounted - or something

          substituting for your home directory?

        </div>

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div class="gmail_attr" dir="ltr">On Fri, 12 Jul 2019 at 15:55,

          Chris Samuel < <a href="mailto:chris@csamuel.org" target="_blank">

            chris@csamuel.org</a>> wrote: <br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left:1px solid rgb(204,204,204)">On

          12/7/19 7:39 am, Pär Lundö wrote: <br>

          <br>

          > Presumably, the first 8 tasks originates from the first

          node (in this <br>

          > case the lxclient11), and the other node (lxclient10)

          response as <br>

          > predicted. <br>

          <br>

          That looks right, it seems the other node has two processes

          fighting <br>

          over the same socket and that's breaking Slurm there. <br>

          <br>

          > Is it neccessary to have passwordless ssh communication

          alongside the <br>

          > munge authentication? <br>

          <br>

          No, srun doesn't need (or use) that at all. <br>

          <br>

          > In addition I checked the slurmctld-log from both the

          server and client <br>

          > and found something (noted in bold): <br>

          <br>

          This is from the slurmd log on the client from the look of it.

          <br>

          <br>

          > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using

          sched affinity <br>

          > for tasks lurm.pmix.83.0: Address already in use[98]* <br>

          > [2019-07-12T14:57:53.682][83.0] error: lxclient[0]

          /pmix.server.c:386 <br>

          > [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv

          <br>

          > [2019-07-12T14:57:53.683][83.0] error: (null) [0]

          /mpi_pmix:156 <br>

          > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR:

          pmixp_stepd_init() failed <br>

          <br>

          That indicates that something else has grabbed the socket it

          wants and <br>

          that's why the setup of the MPI ranks on the second node

          fails. <br>

          <br>

          You'll want to poke around there to see what's using it. <br>

          <br>

          Best of luck! <br>

          Chris <br>

          -- <br>

            Chris Samuel  :  <a href="http://www.csamuel.org/" rel="noreferrer" target="_blank">

            http://www.csamuel.org/</a>  :  Berkeley, CA, USA <br>

          <br>

        </blockquote>

      </div>

    </blockquote>

    <pre class="gmail-m_5257842815186023450moz-signature" cols="72">-- 

Hälsningar, Pär

________________________________

Pär Lundö 

Forskare

Avdelningen för Ledningssystem 

FOI

Totalförsvarets forskningsinstitut

164 90 Stockholm 

Besöksadress:

Olau Magnus väg 33, Linköping

Tel: +46 13 37 86 01

Mob: +46 734 447 815

Vxl: +46 13 37 80 00

<a class="gmail-m_5257842815186023450moz-txt-link-abbreviated" href="mailto:par.lundo@foi.se" target="_blank">par.lundo@foi.se</a>

<a class="gmail-m_5257842815186023450moz-txt-link-abbreviated" href="http://www.foi.se" target="_blank">www.foi.se</a></pre>

  </div>

</blockquote></div>