[slurm-users] Nodes not responding... how does slurm track it?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed May 15 12:33:33 UTC 2019
On 15-05-2019 09:34, Barbara Krašovec wrote:
> It could be a problem with ARP cache.
>
> If the number of devices approaches 512, there is a kernel limitation in
> dynamic ARP-cache size and it can result in the loss of connectivity
> between nodes.
This is something every cluster owner should be aware of, and which may
not be widely known. There are some notes about this in my Slurm Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
/Ole
> The garbage collector will run if the number of entries in the cache is
> less than 128, by default:
>
> *gc_thresh1*
> The minimum number of entries to keep in the ARP cache. The garbage
> collector will not run if there are fewer than this number of entries in
> the cache. Defaults to 128.
> *gc_thresh2*
> The soft maximum number of entries to keep in the ARP cache. The garbage
> collector will allow the number of entries to exceed this for 5 seconds
> before collection will be performed. Defaults to 512.
> *gc_thresh3*
> The hard maximum number of entries to keep in the ARP cache. The garbage
> collector will always run if there are more than this number of entries
> in the cache. Defaults to 1024.
>
> You can check:
>
> cat /proc/net/arp |wc
>
> If this is the case, increase the ARC cache garbe collection parameters:
>
> #Force gc to clean-up quickly
> sysctl -w net.ipv4.neigh.default.gc_interval = 3600
> sysctl -w net.ipv6.neigh.default.gc_interval = 3600
>
> # Set ARP cache entry timeout
> sysctl -w net.ipv4.neigh.default.gc_stale_time = 3600
> sysctl -w net.ipv6.neigh.default.gc_stale_time = 3600
>
> #garbage collector
>
> sysctl -w net.ipv4.neigh.default.gc_thresh1=2048
> sysctl -w net.ipv4.neigh.default.gc_thresh2=4096
> sysctl -w net.ipv4.neigh.default.gc_thresh3=8192
>
> sysctl -w net.ipv6.neigh.default.gc_thresh1=2048
> sysctl -w net.ipv6.neigh.default.gc_thresh2=4096
> sysctl -w net.ipv6.neigh.default.gc_thresh3=8192
>
> Or just insert in /etc/sysctl.conf
>
> Cheers,
>
> Barbara
>
> On 5/15/19 5:02 AM, Bill Broadley wrote:
>> My latest addition to a cluster results in a group of the same nodes periodically getting listed as
>> "not-responding" and usually (but not always) recovering.
>>
>> I increased logging up to debug3 and see messages like:
>> [2019-05-14T17:09:25.247] debug: Spawning ping agent for
>> bigmem[1-9],bm[1,7,9-13],c6-[66-77,87-88,92-97],c8-[62-63],c9-[65,74-75],c10-[18,66-74,87-97],c11-[71-77,86-87,89-93,95-96]
>>
>> And more problematic:
>> [2019-05-14T17:09:26.248] error: Nodes bm13,c6-[66-77,87-88,92-97] not responding
>>
>> Out of 200 nodes, it's almost always those 20. Forward DNS (on both name servers), reverse DNS (on
>> both name servers), netmask, and /etc/hosts seem fine. It's possible the network hardware has a
>> problem, but some nodes on the same switch always work, some don't. Guess it could be an arp table
>> overflow or similar.
>>
>> Despite a fair bit of testing, I've not been able to get any dns lookup or connection request
>> between the slurm controller and compute nodes, or compute to compute.
>>
>> From googling and searching the ticket system it seems like slurm builds a tree, then asks nodes to
>> check on the status of other nodes. From what I can tell if a node is listed as healthy, but can't
>> contact 20 other nodes, those 20 nodes are listed as not-responding. If that happens for longer
>> than the timer, then all those nodes go down.
>>
>> So my questions:
>> 1) can I run this "spawning ping agent" myself to help debug it?
>> 2) can I get puppet to print out this tree so I can figure out which node
>> can not contact the nodes are being listed as down?
>> 3) is there any other way to track down which node tried and failed to
>> contact the not-reponding
>>
More information about the slurm-users
mailing list