[slurm-users] Nodes not responding... how does slurm track it?

Wed May 15 07:34:09 UTC 2019

It could be a problem with ARP cache.

If the number of devices approaches 512, there is a kernel limitation in
dynamic ARP-cache size and it can result in the loss of connectivity
between nodes.

The garbage collector will run if the number of entries in the cache is
less than 128, by default:

*gc_thresh1*
The minimum number of entries to keep in the ARP cache. The garbage
collector will not run if there are fewer than this number of entries in
the cache. Defaults to 128.
*gc_thresh2*
The soft maximum number of entries to keep in the ARP cache. The garbage
collector will allow the number of entries to exceed this for 5 seconds
before collection will be performed. Defaults to 512.
*gc_thresh3*
The hard maximum number of entries to keep in the ARP cache. The garbage
collector will always run if there are more than this number of entries
in the cache. Defaults to 1024.

You can check:

cat /proc/net/arp |wc

If this is the case, increase the ARC cache garbe collection parameters:

 #Force gc to clean-up quickly
sysctl -w net.ipv4.neigh.default.gc_interval = 3600
sysctl -w net.ipv6.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
sysctl -w net.ipv4.neigh.default.gc_stale_time = 3600
sysctl -w net.ipv6.neigh.default.gc_stale_time = 3600

#garbage collector

sysctl -w net.ipv4.neigh.default.gc_thresh1=2048
sysctl -w net.ipv4.neigh.default.gc_thresh2=4096
sysctl -w net.ipv4.neigh.default.gc_thresh3=8192

sysctl -w net.ipv6.neigh.default.gc_thresh1=2048
sysctl -w net.ipv6.neigh.default.gc_thresh2=4096
sysctl -w net.ipv6.neigh.default.gc_thresh3=8192

Or just insert in /etc/sysctl.conf

Cheers,

Barbara

On 5/15/19 5:02 AM, Bill Broadley wrote:
> My latest addition to a cluster results in a group of the same nodes periodically getting listed as
> "not-responding" and usually (but not always) recovering.
>
> I increased logging up to debug3 and see messages like:
> [2019-05-14T17:09:25.247] debug:  Spawning ping agent for
> bigmem[1-9],bm[1,7,9-13],c6-[66-77,87-88,92-97],c8-[62-63],c9-[65,74-75],c10-[18,66-74,87-97],c11-[71-77,86-87,89-93,95-96]
>
> And more problematic:
> [2019-05-14T17:09:26.248] error: Nodes bm13,c6-[66-77,87-88,92-97] not responding
>
> Out of 200 nodes, it's almost always those 20.  Forward DNS (on both name servers), reverse DNS (on
> both name servers), netmask, and /etc/hosts seem fine.  It's possible the network hardware has a
> problem, but some nodes on the same switch always work, some don't.  Guess it could be an arp table
> overflow or similar.
>
> Despite a fair bit of testing, I've not been able to get any dns lookup or connection request
> between the slurm controller and compute nodes, or compute to compute.
>
> From googling and searching the ticket system it seems like slurm builds a tree, then asks nodes to
> check on the status of other nodes.  From what I can tell if a node is listed as healthy, but can't
> contact 20 other nodes, those 20 nodes are listed as not-responding.  If that happens for longer
> than the timer, then all those nodes go down.
>
> So my questions:
> 1) can I run this "spawning ping agent" myself to help debug it?
> 2) can I get puppet to print out this tree so I can figure out which node
>    can not contact the nodes are being listed as down?
> 3) is there any other way to track down which node tried and failed to
>    contact the not-reponding
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190515/4025542f/attachment-0001.html>