[slurm-users] Compute nodes cycling from idle to down on a regular basis ?
Jeremy.Fix at centralesupelec.fr
Tue Feb 1 09:37:41 UTC 2022
we are facing a weird issue. On a regular basis, some compute nodes go
from *idle* -> *idle** -> *down* and loop back to idle on its own; The
slurm manages several nodes and this state cycle appears only for some
pools of nodes.
We get a trace on the compute node as :
[2022-02-01T09:41:11.381] error: Munge decode failed: Invalid credential
[2022-02-01T09:41:11.381] ENCODED: Thu Jan 01 01:00:00 1970
[2022-02-01T09:41:11.381] DECODED: Thu Jan 01 01:00:00 1970
[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward:
REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid
[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward: Protocol
[2022-02-01T09:41:11.391] error: service_connection: slurm_receive_msg:
Protocol authentication error
[2022-02-01T09:41:11.392] debug2: Finish processing RPC:
On the master, the only thing we get is , sometimes :
- slurmctld.log:[2022-02-01T10:00:04.456] agent/is_node_resp:
node:node45 RPC:REQUEST_PING : Can't find an address, check slurm.conf
On the slurm master, the ips are not specified in /etc/hosts but by
/etc/resolv.conf ; One hypothesis we have is that maybe our DNS server
is taking some times, sometimes, to respond.
This happens on a very regular basis, exactly every 1h07 and for some
nodes every 3 minutes.
We thought this might be due to munge but :
- _We tried to resync the munge_ keys.
_- The time is correctly synchronized with a ntp server ; _ calling date
as root on both nodes return the same date
_- Munge uid/gid are correct :_
root at node45:/var/log/slurm# ls -l /etc/munge/
-r-------- 1 munge munge 1024 janv. 27 18:49 munge.key
- _We can encrypt/decrypt successfully _;
root at slurmaster:~$ munge -n | ssh node45 unmunge
STATUS: Success (0)
ENCODE_HOST: node45 (127.0.1.1)
ENCODE_TIME: 2022-02-01 10:22:21 +0100 (1643707341)
DECODE_TIME: 2022-02-01 10:22:23 +0100 (1643707343)
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
Do you have any idea on how to debug and hopefully solve that issue ?
Thank you !
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users