<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hello everyone,</p>
<p>we are facing a weird issue. On a regular basis, some compute
nodes go from <b>idle</b> -> <b>idle*</b> -> <b>down</b>
and loop back to idle on its own; The slurm manages several nodes
and this state cycle appears only for some pools of nodes.<br>
</p>
<p>We get a trace on the compute node as : <br>
</p>
<p>[2022-02-01T09:41:11.381] error: Munge decode failed: Invalid
credential<br>
[2022-02-01T09:41:11.381] ENCODED: Thu Jan 01 01:00:00 1970<br>
[2022-02-01T09:41:11.381] DECODED: Thu Jan 01 01:00:00 1970<br>
[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward:
REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid
authentication credential<br>
[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward:
Protocol authentication error<br>
[2022-02-01T09:41:11.391] error: service_connection:
slurm_receive_msg: Protocol authentication error<br>
[2022-02-01T09:41:11.392] debug2: Finish processing RPC:
RESPONSE_FORWARD_FAILED</p>
<p>On the master, the only thing we get is , sometimes : <br>
</p>
<p>
</p>
<p>- slurmctld.log:[2022-02-01T10:00:04.456] agent/is_node_resp:
node:node45 RPC:REQUEST_PING : Can't find an address, check
slurm.conf</p>
<p>On the slurm master, the ips are not specified in /etc/hosts but
by /etc/resolv.conf ; One hypothesis we have is that maybe our DNS
server is taking some times, sometimes, to respond. </p>
<p>This happens on a very regular basis, exactly every 1h07 and for
some nodes every 3 minutes.<br>
</p>
<p>We thought this might be due to munge but :</p>
<p>- <u>We tried to resync the munge</u> keys.<br>
</p>
<p><u>- The time is correctly synchronized with a ntp server ; </u>
calling date as root on both nodes return the same date <br>
</p>
<p><u>- Munge uid/gid are correct :</u><br>
</p>
<p>root@node45:/var/log/slurm# ls -l /etc/munge/<br>
-r-------- 1 munge munge 1024 janv. 27 18:49 munge.key</p>
<p>- <u>We can encrypt/decrypt successfully </u>; <br>
</p>
<p>root@slurmaster:~$ munge -n | ssh node45 unmunge</p>
<p>STATUS: Success (0)<br>
ENCODE_HOST: node45 (127.0.1.1)<br>
ENCODE_TIME: 2022-02-01 10:22:21 +0100 (1643707341)<br>
DECODE_TIME: 2022-02-01 10:22:23 +0100 (1643707343)<br>
TTL: 300<br>
CIPHER: aes128 (4)<br>
MAC: sha256 (5)<br>
ZIP: none (0)<br>
UID: .....<br>
GID: ......<br>
LENGTH: 0<br>
<br>
</p>
<p><br>
</p>
<p>Do you have any idea on how to debug and hopefully solve that
issue ? <br>
</p>
<p>Thank you !</p>
<p>Jeremy<br>
</p>
</body>
</html>