[slurm-users] Nodes not responding... how does slurm track it?

Wed May 15 12:33:33 UTC 2019

On 15-05-2019 09:34, Barbara Krašovec wrote:
> It could be a problem with ARP cache.
> 
> If the number of devices approaches 512, there is a kernel limitation in 
> dynamic ARP-cache size and it can result in the loss of connectivity 
> between nodes.

This is something every cluster owner should be aware of, and which may 
not be widely known.  There are some notes about this in my Slurm Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks

/Ole

> The garbage collector will run if the number of entries in the cache is 
> less than 128, by default:
> 
> *gc_thresh1*
> The minimum number of entries to keep in the ARP cache. The garbage 
> collector will not run if there are fewer than this number of entries in 
> the cache. Defaults to 128.
> *gc_thresh2*
> The soft maximum number of entries to keep in the ARP cache. The garbage 
> collector will allow the number of entries to exceed this for 5 seconds 
> before collection will be performed. Defaults to 512.
> *gc_thresh3*
> The hard maximum number of entries to keep in the ARP cache. The garbage 
> collector will always run if there are more than this number of entries 
> in the cache. Defaults to 1024.
> 
> You can check:
> 
> cat /proc/net/arp |wc
> 
> If this is the case, increase the ARC cache garbe collection parameters:
> 
>   #Force gc to clean-up quickly
> sysctl -w net.ipv4.neigh.default.gc_interval = 3600
> sysctl -w net.ipv6.neigh.default.gc_interval = 3600
> 
> # Set ARP cache entry timeout
> sysctl -w net.ipv4.neigh.default.gc_stale_time = 3600
> sysctl -w net.ipv6.neigh.default.gc_stale_time = 3600
> 
> #garbage collector
> 
> sysctl -w net.ipv4.neigh.default.gc_thresh1=2048
> sysctl -w net.ipv4.neigh.default.gc_thresh2=4096
> sysctl -w net.ipv4.neigh.default.gc_thresh3=8192
> 
> sysctl -w net.ipv6.neigh.default.gc_thresh1=2048
> sysctl -w net.ipv6.neigh.default.gc_thresh2=4096
> sysctl -w net.ipv6.neigh.default.gc_thresh3=8192
> 
> Or just insert in /etc/sysctl.conf
> 
> Cheers,
> 
> Barbara
> 
> On 5/15/19 5:02 AM, Bill Broadley wrote:
>> My latest addition to a cluster results in a group of the same nodes periodically getting listed as
>> "not-responding" and usually (but not always) recovering.
>>
>> I increased logging up to debug3 and see messages like:
>> [2019-05-14T17:09:25.247] debug:  Spawning ping agent for
>> bigmem[1-9],bm[1,7,9-13],c6-[66-77,87-88,92-97],c8-[62-63],c9-[65,74-75],c10-[18,66-74,87-97],c11-[71-77,86-87,89-93,95-96]
>>
>> And more problematic:
>> [2019-05-14T17:09:26.248] error: Nodes bm13,c6-[66-77,87-88,92-97] not responding
>>
>> Out of 200 nodes, it's almost always those 20.  Forward DNS (on both name servers), reverse DNS (on
>> both name servers), netmask, and /etc/hosts seem fine.  It's possible the network hardware has a
>> problem, but some nodes on the same switch always work, some don't.  Guess it could be an arp table
>> overflow or similar.
>>
>> Despite a fair bit of testing, I've not been able to get any dns lookup or connection request
>> between the slurm controller and compute nodes, or compute to compute.
>>
>>  From googling and searching the ticket system it seems like slurm builds a tree, then asks nodes to
>> check on the status of other nodes.  From what I can tell if a node is listed as healthy, but can't
>> contact 20 other nodes, those 20 nodes are listed as not-responding.  If that happens for longer
>> than the timer, then all those nodes go down.
>>
>> So my questions:
>> 1) can I run this "spawning ping agent" myself to help debug it?
>> 2) can I get puppet to print out this tree so I can figure out which node
>>     can not contact the nodes are being listed as down?
>> 3) is there any other way to track down which node tried and failed to
>>     contact the not-reponding
>>