[slurm-users] Nodes not responding... how does slurm track it?

Wed May 15 10:42:42 UTC 2019

Hi;

Do not think "the number of devices" as "the number of servers". If a 
devices which have a MAC address and connected to your node's local 
networks, it counts as a device. For example, if your BMC ports 
(ILO,iDRAC etc.) connected to one of the networks of your nodes, it 
doubles the number of devices.

To test the arp issue, you can keep pinging from the slurmctl server to 
the problematic node, and from the problematic node to the slurmctl 
server. Continuous pinging will keep the node in the arp table.

Ahmet M.

On 15.05.2019 11:54, Bill Broadley wrote:
> On 5/15/19 12:34 AM, Barbara Krašovec wrote:
>> It could be a problem with ARP cache.
>>
>> If the number of devices approaches 512, there is a kernel limitation in dynamic
>> ARP-cache size and it can result in the loss of connectivity between nodes.
> We have 162 compute nodes, a dozen or so file servers, head node, transfer node,
> and not much else.  Despite significant tinkering I never got a DNS lookup
> (forward or reverse), ping, nslookup, dig, ssh, or telnet to the
> slurmd/slurmctld ports to fail.
>
> All while /var/log/syslog was complaining that DNS wasn't working to the same 20
> nodes.
>
>  From what I can tell slurm builds a tree of hosts to check, a compute node
> checks out 20 or so hosts.  From what I can tell slurmd caches the dns results
> (which is good), but also caches the DNS non-results.  So even while I'm logged
> into a node verifying that both DNS servers and lookup all the down hosts
> forward and backwards syslog is still complaining often about failures in DNS
> lookups.
>
> What's worse is this still caused problems even when that node was put in drain
> mode.  So all 20+ hosts (of 160) would bounce between online (alloc/idle) to
> offline (alloc*/idle*).  If it got unlucky and had a few in a row the node would
> timeout, be marked down, and all the jobs killed.
>
> This is with slurm 18.08.7 that I compiled for Ubuntu LTS 18.04.
>
>> The garbage collector will run if the number of entries in the cache is less
>> than 128, by default:
> I checked the problematic host (the one that frequently complained that 20 hosts
> had no DNS) and it had 116 arp entries.
>
> [ snipped much useful sysctl info ]
>
>> Or just insert in /etc/sysctl.con
> Many thanks, useful stuff that I'll keep in my notes.  In this case though I
> think the slurm "tree" is improperly caching the absence of DNS records.
>
> I checked for a single host and:
> bigmem1# cat /var/log/syslog| grep c6-66 |grep "May 14"| wc -l
> 51
> root at bigmem1:/var/log/slurm-llnl# cat /var/log/syslog| grep c6-66 |grep "May
> 14"| tail -1
> May 14 23:30:22 bigmem1 slurmd[46951]: error: forward_thread: can't find address
> for host c6-66, check slurm.conf
>
> So despite having /etc/resolv.conf point directly to two name servers that could
>   lookup c6-66 -> 10.17.6.66 or 10.17.6.66 -> c6-66 it kept telling the slurm
> controller that c6-66 didn't exist.  During that time bigmem1 could ssh, telnet,
> dig, nslookup, to c6-66.
>
> I suspect bigmem1 was assigned the slurm node check tree last Wednesday when we
> provisioned those nodes.  The entries might well have been put into slurm before
> they were put into DNS (managed by cobbler).  Then bigmem1 caches those negative
> records since Wednesday and kept informing the slurm controller that they didn't
> exist.
>
> A reboot of bigmem1 fixed the problem.
>
>