<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>That looks like a DNS issue.</p>

    <p>Verify all your nodes are able to resolve the names of each

      other.</p>

    <p>Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on

      the nodes (including head/login nodes) to ensure they all match.<br>

    </p>

    <p>Brian Andrus<br>

    </p>

    <p>On 2/1/2022 1:37 AM, Jeremy Fix wrote:<br>

    </p>

    <blockquote type="cite"

      cite="mid:2d3cb3bc-6a07-2952-8f20-dacbd6deb05f@centralesupelec.fr">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <p>Hello everyone,</p>

      <p>we are facing a weird issue. On a regular basis, some compute

        nodes go from <b>idle</b> -> <b>idle*</b> -> <b>down</b> 

        and loop back to idle on its own;  The slurm manages several

        nodes and this state cycle appears only for some pools of nodes.<br>

      </p>

      <p>We get a trace on the compute node as : <br>

      </p>

      <p>[2022-02-01T09:41:11.381] error: Munge decode failed: Invalid

        credential<br>

        [2022-02-01T09:41:11.381] ENCODED: Thu Jan 01 01:00:00 1970<br>

        [2022-02-01T09:41:11.381] DECODED: Thu Jan 01 01:00:00 1970<br>

        [2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward:

        REQUEST_NODE_REGISTRATION_STATUS has authentication error:

        Invalid authentication credential<br>

        [2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward:

        Protocol authentication error<br>

        [2022-02-01T09:41:11.391] error: service_connection:

        slurm_receive_msg: Protocol authentication error<br>

        [2022-02-01T09:41:11.392] debug2: Finish processing RPC:

        RESPONSE_FORWARD_FAILED</p>

      <p>On the master, the only thing we get is , sometimes : <br>

      </p>

      <p> </p>

      <p>- slurmctld.log:[2022-02-01T10:00:04.456] agent/is_node_resp:

        node:node45 RPC:REQUEST_PING : Can't find an address, check

        slurm.conf</p>

      <p>On the slurm master, the ips are not specified in /etc/hosts

        but by /etc/resolv.conf ; One hypothesis we have is that maybe

        our DNS server is taking some times, sometimes, to respond. </p>

      <p>This happens on a very regular basis, exactly every 1h07 and

        for some nodes every 3 minutes.<br>

      </p>

      <p>We thought this might be due to munge but :</p>

      <p>- <u>We tried to resync the munge</u> keys.<br>

      </p>

      <p><u>- The time is correctly synchronized with a ntp server ; </u>

        calling date as root on both nodes return the same date <br>

      </p>

      <p><u>- Munge uid/gid are correct :</u><br>

      </p>

      <p>root@node45:/var/log/slurm# ls -l /etc/munge/<br>

        -r-------- 1 munge munge 1024 janv. 27 18:49 munge.key</p>

      <p>- <u>We can encrypt/decrypt successfully </u>; <br>

      </p>

      <p>root@slurmaster:~$  munge -n | ssh node45 unmunge</p>

      <p>STATUS:           Success (0)<br>

        ENCODE_HOST:      node45 (127.0.1.1)<br>

        ENCODE_TIME:      2022-02-01 10:22:21 +0100 (1643707341)<br>

        DECODE_TIME:      2022-02-01 10:22:23 +0100 (1643707343)<br>

        TTL:              300<br>

        CIPHER:           aes128 (4)<br>

        MAC:              sha256 (5)<br>

        ZIP:              none (0)<br>

        UID:              .....<br>

        GID:              ......<br>

        LENGTH:           0<br>

        <br>

      </p>

      <p><br>

      </p>

      <p>Do you have any idea on how to debug and hopefully solve that

        issue ? <br>

      </p>

      <p>Thank you !</p>

      <p>Jeremy<br>

      </p>

    </blockquote>

  </body>

</html>