[slurm-users] Nodes not responding... how does slurm track it?

Wed May 15 03:02:16 UTC 2019

My latest addition to a cluster results in a group of the same nodes periodically getting listed as
"not-responding" and usually (but not always) recovering.

I increased logging up to debug3 and see messages like:
[2019-05-14T17:09:25.247] debug:  Spawning ping agent for
bigmem[1-9],bm[1,7,9-13],c6-[66-77,87-88,92-97],c8-[62-63],c9-[65,74-75],c10-[18,66-74,87-97],c11-[71-77,86-87,89-93,95-96]

And more problematic:
[2019-05-14T17:09:26.248] error: Nodes bm13,c6-[66-77,87-88,92-97] not responding

Out of 200 nodes, it's almost always those 20.  Forward DNS (on both name servers), reverse DNS (on
both name servers), netmask, and /etc/hosts seem fine.  It's possible the network hardware has a
problem, but some nodes on the same switch always work, some don't.  Guess it could be an arp table
overflow or similar.

Despite a fair bit of testing, I've not been able to get any dns lookup or connection request
between the slurm controller and compute nodes, or compute to compute.

>From googling and searching the ticket system it seems like slurm builds a tree, then asks nodes to
check on the status of other nodes.  From what I can tell if a node is listed as healthy, but can't
contact 20 other nodes, those 20 nodes are listed as not-responding.  If that happens for longer
than the timer, then all those nodes go down.

So my questions:
1) can I run this "spawning ping agent" myself to help debug it?
2) can I get puppet to print out this tree so I can figure out which node
   can not contact the nodes are being listed as down?
3) is there any other way to track down which node tried and failed to
   contact the not-reponding