[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Tue Feb 1 19:28:50 UTC 2022

Brian, Bjorn, thank you for your answers;

- From every compute node, I checked I could  nslookup  some other compute nodes as well as the slurm master for their hostnames; That worked;

In the mean time we identified other issues . Apparently, that solved the problem for part of the nodes (kyle[46-68]) but not others (kyle[01-45])

1) we are migrating from a previous slurm master to a new one and ... the old one still had its slurmctld running with the nodes listed. I think that explained the munge credentials traces . This were certainly coming from the old master
2) we had 2 network interfaces on the compute nodes; It appears that requests on the DHCP were flip flopping the IP between the two interfaces. I'm not sure, but this unusual thing may have created trouble to the slurm master; We simply deactivated one of the two interfaces to prevent that from happening

Unfortunately, even after solving this (and restarting the slurmctld, slurmd, rebooting the compute nodes), we still have issues on 45 compute nodes, while 20 others are now fine. The difference I notice in the slurmd log on the compute node is that :

- for nodes still cycling in idle*->drain , the last entry of the log is :

[2022-02-01T18:45:25.437] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS

- for nodes that are now staying in idle, the last entry of the log is

[2022-02-01T18:45:25.477] debug2: Finish processing RPC: REQUEST_NODE_REGISTRATION_STATUS
[2022-02-01T19:18:45.835] debug3: in the service_connection
[2022-02-01T19:18:45.837] debug2: Start processing RPC: REQUEST_PING
[2022-02-01T19:18:45.837] debug2: Finish processing RPC: REQUEST_PING

So, there is this missing "REQUEST_PING" RPC on the draining nodes.  On the slurm master, I see, for all the drained nodes, a bunch of : "RPC:REQUEST_PING : Can't find an address, check slurm.conf" and then "Nodes kyle[01-45] not responding", 'error: Nodes kyle[01-45] not responding, setting DOWN'

Sometimes, they come back to life. On the SLURM master logs, I see some "[2022-02-01T19:52:06.941] Node kyle47 now responding", "[2022-02-01T19:52:06.941] Node kyle46 now responding"

Is there any timeout for waiting for a node to respond that might be too short ? Actually, I do not see why they may not be responding;

Thank you for your help,

Jeremy.

>That looks like a DNS issue.
>
>Verify all your nodes are able to resolve the names of each other.

>Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the 
>nodes (including head/login nodes) to ensure they all match.

>Brian Andrus

On 2/1/2022 1:37 AM, Jeremy Fix wrote:

>/Hello everyone, />//>/we are facing a weird issue. On a regular basis, some compute nodes go />/from *idle* -> *idle** -> *down* and loop back to idle on its own; />/The slurm manages several nodes and this state cycle appears only for />/some pools of nodes. />//>/We get a trace on the compute node as : />//>/[2022-02-01T09:41:11.381] error: Munge decode failed: Invalid credential />/[2022-02-01T09:41:11.381] ENCODED: Thu Jan 01 01:00:00 1970 />/[2022-02-01T09:41:11.381] DECODED: Thu Jan 01 01:00:00 1970 />/[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward: />/REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid />/authentication credential />/[2022-02-01T09:41:11.381] error: slurm_receive_msg_and_forward: />/Protocol authentication error />/[2022-02-01T09:41:11.391] error: service_connection: />/slurm_receive_msg: Protocol authentication error />/[2022-02-01T09:41:11.392] debug2: Finish processing RPC: />/RESPONSE_FORWARD_FAILED />//>/On the master, the only thing we get is , sometimes : />//>/- slurmctld.log:[2022-02-01T10:00:04.456] agent/is_node_resp: />/node:node45 RPC:REQUEST_PING : Can't find an address, check slurm.conf />//>/On the slurm master, the ips are not specified in /etc/hosts but by />//etc/resolv.conf ; One hypothesis we have is that maybe our DNS server />/is taking some times, sometimes, to respond. />//>/This happens on a very regular basis, exactly every 1h07 and for some />/nodes every 3 minutes. />//>/We thought this might be due to munge but : />//>/- _We tried to resync the munge_ keys. />//>/_- The time is correctly synchronized with a ntp server ; _ calling />/date as root on both nodes return the same date />//>/_- Munge uid/gid are correct :_ />//>/root at node45 
<https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users>:/var/log/slurm# 
ls -l /etc/munge/ />/-r-------- 1 munge munge 1024 janv. 27 18:49 munge.key />//>/- _We can encrypt/decrypt successfully _; />//>/root at slurmaster 
<https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users>:~$  
munge -n | ssh node45 unmunge />//>/STATUS:           Success (0) />/ENCODE_HOST:      node45 (127.0.1.1) />/ENCODE_TIME:      2022-02-01 10:22:21 +0100 (1643707341) />/DECODE_TIME:      2022-02-01 10:22:23 +0100 (1643707343) />/TTL:              300 />/CIPHER:           aes128 (4) />/MAC:              sha256 (5) />/ZIP:              none (0) />/UID:              ..... />/GID:              ...... />/LENGTH:           0 />//>//>/Do you have any idea on how to debug and hopefully solve that issue ? />//>/Thank you ! />//>/Jeremy />

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220201/a65d26f6/attachment-0001.htm>