<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    I have tried to find a network error but can't see anything. Every

    node I've tested has the same (and correct) view of things.<br>

    <br>

    <u>On node cn7:</u> (the problematic one)<br>

    <tt>em1: link/ether 50:9a:4c:79:31:4d inet 10.28.3.137/24</tt><br>

    <br>

    <u>On login machine:</u><br>

    <tt>[alex@li1 ~]$ host cn7</tt><tt><br>

    </tt><tt>cn7.ydesign.se has address 10.28.3.137</tt><tt><br>

    </tt><tt>[alex@li1 ~]$ arp cn7</tt><tt><br>

    </tt><tt>Address                  HWtype  HWaddress           Flags

      Mask            Iface</tt><tt><br>

    </tt><tt>cn7.ydesign.se           ether   50:9a:4c:79:31:4d  

      C                     em1</tt><br>

    <br>

    <u>On slurmctld machine:</u><br>

    <tt>[alex@cmgr1 ~]$ host cn7</tt><tt><br>

    </tt><tt>cn7.ydesign.se has address 10.28.3.137</tt><tt><br>

    </tt><tt>[alex@cmgr1 ~]$ arp cn7</tt><tt><br>

    </tt><tt>Address                  HWtype  HWaddress           Flags

      Mask            Iface</tt><tt><br>

    </tt><tt>cn7.ydesign.se           ether   50:9a:4c:79:31:4d  

      C                     em1</tt><br>

    <br>

    <br>

    Yes, I have seen your pages and must say that they have been pure

    gold on many occasions, thanks a lot Ole! But our cluster is still

    tiny and the whole cluster is located in its own network segment.

    The number of ARP entries is far from 512 (actually, more like ~30).<br>

    <br>

    I just don't understand why sbatch works but not srun?<br>

    Could this be some error in the state files perhaps? Something that

    maybe got corrupted when the node (cn7) unexpectedly died?<br>

    <br>

    Regards,<br>

    Alexander<br>

    <br>

    <br>

    <br>

    <div class="moz-cite-prefix">Den 2019-05-29 kl. 15:12, skrev Ole

      Holm Nielsen:<br>

    </div>

    <blockquote type="cite"

      cite="mid:0ab26468-c785-f2bc-93d5-140f57d436dd@fysik.dtu.dk">Hi

      Alexander,

      <br>

      <br>

      The error "can't find address for host cn7" would indicate a DNS

      problem.  What is the output of "host cn7" from the srun host li1?

      <br>

      <br>

      How many network devices are in your subnet?  It may be that the

      Linux kernel is doing "ARP cache trashing" if the number of

      devices approaches 512.  What is the result of "arp cn7"?

      <br>

      <br>

      To fix ARP cache trashing look in my Slurm Wiki page

      <br>

<a class="moz-txt-link-freetext" href="https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks">https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks</a>

      <br>

      <br>

      Best regards,

      <br>

      Ole

      <br>

      <br>

      On 5/29/19 3:00 PM, Alexander Åhman wrote:

      <br>

      <blockquote type="cite">Hi,

        <br>

        Have a very strange problem. The cluster has been working just

        fine until one node died and now I can't submit jobs to 2 of the

        nodes using srun from the login machine. Using sbatch works just

        fine and also if I use srun from the same host as slurmctld.

        <br>

        All the other nodes works just fine as they always has, only 2

        nodes are experiencing this problem. Very strange...

        <br>

        <br>

        Have checked network connectivity and DNS and that is OK. I can

        ping, ssh to all nodes just fine. All nodes are identical and

        using Slurm 18.08.

        <br>

        Also tested to reboot the 2 nodes and slurmctld but still same

        problem.

        <br>

        <br>

        [alex@li1 ~]$ srun -w cn7 hostname

        <br>

        srun: error: fwd_tree_thread: can't find address for host cn7,

        check slurm.conf

        <br>

        srun: error: Task launch for 1088816.0 failed on node cn7: Can't

        find an address, check slurm.conf

        <br>

        srun: error: Application launch failed: Can't find an address,

        check slurm.conf

        <br>

        srun: Job step aborted: Waiting up to 32 seconds for job step to

        finish.

        <br>

        srun: error: Timed out waiting for job step to complete

        <br>

        <br>

        [alex@li1 ~]$ srun -w cn6 hostname

        <br>

        cn6.ydesign.se

        <br>

        <br>

        What is this error "can't find address for host" about? Have

        searched the web but can't find any good information about what

        the problem is or what to do to resolve it.

        <br>

        <br>

        Any kind soul out there who knows what to do next?

        <br>

      </blockquote>

      <br>

    </blockquote>

    <br>

  </body>

</html>