[slurm-users] Compute nodes cycling from idle to down on a regular basis ?
Brian Andrus
toomuchit at gmail.com
Tue Feb 1 15:17:01 UTC 2022
That looks like a DNS issue.
Verify all your nodes are able to resolve the names of each other.
Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the
nodes (including head/login nodes) to ensure they all match.
Brian Andrus
On 2/1/2022 1:37 AM, Jeremy Fix wrote:
> Hello everyone,
>
> we are facing a weird issue. On a regular basis, some compute nodes go
> from *idle* -> *idle** -> *down* and loop back to idle on its own;
> The slurm manages several nodes and this state cycle appears only for
> some pools of nodes.
>
> We get a trace on the compute node as :
>
> [2022-02-01T09:41:11.381] error: Munge decode failed: Invalid credential
> [2022-02-01T09:41:11.381] ENCODED: Thu Jan 01 01:00:00 1970
> [2022-02-01T09:41:11.381] DECODED: Thu Jan 01 01:00:00 1970
> [2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward:
> REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid
> authentication credential
> [2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward:
> Protocol authentication error
> [2022-02-01T09:41:11.391] error: service_connection:
> slurm_receive_msg: Protocol authentication error
> [2022-02-01T09:41:11.392] debug2: Finish processing RPC:
> RESPONSE_FORWARD_FAILED
>
> On the master, the only thing we get is , sometimes :
>
> - slurmctld.log:[2022-02-01T10:00:04.456] agent/is_node_resp:
> node:node45 RPC:REQUEST_PING : Can't find an address, check slurm.conf
>
> On the slurm master, the ips are not specified in /etc/hosts but by
> /etc/resolv.conf ; One hypothesis we have is that maybe our DNS server
> is taking some times, sometimes, to respond.
>
> This happens on a very regular basis, exactly every 1h07 and for some
> nodes every 3 minutes.
>
> We thought this might be due to munge but :
>
> - _We tried to resync the munge_ keys.
>
> _- The time is correctly synchronized with a ntp server ; _ calling
> date as root on both nodes return the same date
>
> _- Munge uid/gid are correct :_
>
> root at node45:/var/log/slurm# ls -l /etc/munge/
> -r-------- 1 munge munge 1024 janv. 27 18:49 munge.key
>
> - _We can encrypt/decrypt successfully _;
>
> root at slurmaster:~$ munge -n | ssh node45 unmunge
>
> STATUS: Success (0)
> ENCODE_HOST: node45 (127.0.1.1)
> ENCODE_TIME: 2022-02-01 10:22:21 +0100 (1643707341)
> DECODE_TIME: 2022-02-01 10:22:23 +0100 (1643707343)
> TTL: 300
> CIPHER: aes128 (4)
> MAC: sha256 (5)
> ZIP: none (0)
> UID: .....
> GID: ......
> LENGTH: 0
>
>
> Do you have any idea on how to debug and hopefully solve that issue ?
>
> Thank you !
>
> Jeremy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220201/91bde63d/attachment.htm>
More information about the slurm-users
mailing list