[slurm-users] Nodes not responding... how does slurm track it?

Wed May 15 08:54:36 UTC 2019

On 5/15/19 12:34 AM, Barbara Krašovec wrote:
> It could be a problem with ARP cache.
> 
> If the number of devices approaches 512, there is a kernel limitation in dynamic
> ARP-cache size and it can result in the loss of connectivity between nodes.

We have 162 compute nodes, a dozen or so file servers, head node, transfer node,
and not much else.  Despite significant tinkering I never got a DNS lookup
(forward or reverse), ping, nslookup, dig, ssh, or telnet to the
slurmd/slurmctld ports to fail.

All while /var/log/syslog was complaining that DNS wasn't working to the same 20
nodes.

>From what I can tell slurm builds a tree of hosts to check, a compute node
checks out 20 or so hosts.  From what I can tell slurmd caches the dns results
(which is good), but also caches the DNS non-results.  So even while I'm logged
into a node verifying that both DNS servers and lookup all the down hosts
forward and backwards syslog is still complaining often about failures in DNS
lookups.

What's worse is this still caused problems even when that node was put in drain
mode.  So all 20+ hosts (of 160) would bounce between online (alloc/idle) to
offline (alloc*/idle*).  If it got unlucky and had a few in a row the node would
timeout, be marked down, and all the jobs killed.

This is with slurm 18.08.7 that I compiled for Ubuntu LTS 18.04.

> The garbage collector will run if the number of entries in the cache is less
> than 128, by default:

I checked the problematic host (the one that frequently complained that 20 hosts
had no DNS) and it had 116 arp entries.

[ snipped much useful sysctl info ]

> Or just insert in /etc/sysctl.con

Many thanks, useful stuff that I'll keep in my notes.  In this case though I
think the slurm "tree" is improperly caching the absence of DNS records.

I checked for a single host and:
bigmem1# cat /var/log/syslog| grep c6-66 |grep "May 14"| wc -l
51
root at bigmem1:/var/log/slurm-llnl# cat /var/log/syslog| grep c6-66 |grep "May
14"| tail -1
May 14 23:30:22 bigmem1 slurmd[46951]: error: forward_thread: can't find address
for host c6-66, check slurm.conf

So despite having /etc/resolv.conf point directly to two name servers that could
 lookup c6-66 -> 10.17.6.66 or 10.17.6.66 -> c6-66 it kept telling the slurm
controller that c6-66 didn't exist.  During that time bigmem1 could ssh, telnet,
dig, nslookup, to c6-66.

I suspect bigmem1 was assigned the slurm node check tree last Wednesday when we
provisioned those nodes.  The entries might well have been put into slurm before
they were put into DNS (managed by cobbler).  Then bigmem1 caches those negative
records since Wednesday and kept informing the slurm controller that they didn't
exist.

A reboot of bigmem1 fixed the problem.