[slurm-users] 2 nodes being randomly set to "not responding"

Wed Jul 21 20:30:43 UTC 2021

Hi all,

We have a single slurm cluster with multiple different architectures and
compute clusters talking to a single slurmctld. This slurmctld is
dual-homed on two different networks. We have two individual nodes who are
by themselves on "network 2" while all of the other nodes are on "network
1".  They will stay online for a short period of time, but then be marked
as down and not responding by slurmctld. 10 to 20 minutes later they will
be back online, rinse and repeat. There are absolutely no firewalls
involved anywhere in the network.

I found a mailing list post back in 2018 where a guy mentioned that
slurmd's all expect to be able to talk to each other, and when you have
some nodes segmented off from others you can get this flapping behavior. He
mentioned to try setting TreeWidth to 1 to force the slurmd's to only
communicate directly with slurmctld. I gave that a shot and it
unfortunately seemed to make all of the other nodes no longer be reachable!
:-)

Is there a way of properly configuring our setup so that we can have a
proper dual-homed slurmctld and not require every node be reachable by
every other node?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210721/4f3e691a/attachment.htm>