[slurm-users] Debugging communication problems

Gerhard Strangar g.s at arcor.de
Tue Aug 4 17:26:16 UTC 2020


Hi,

I'm experiencing a connectivity problem and I'm out of ideas, why this
is happening. I'm running a slurmctld on a multihomed host.

(10.9.8.0/8) - master - (10.11.12.0/8)
There is no routing between these two subnets.

So far, all slurmds resided in the first subnet and worked fine. I added
some in the second subnet and they keep changing into the DOWN state. I
checked the "last slurmd control message" and sometimes it's overdue for
20 minutes and more with a configured slurmd timeout of 5 minutes. I did
a tcpdump and it showed that the slurmctld isn't even trying to connect
to the slurmds at that time. I haven't found any packet loss yet, the
redundant DNS servers are both resolving the host names properly at that
time and slurmctld just states a communications error for the ping
request while slurmds are running and all hosts are idle.
What reasons can there be for not contacting the slurmds? Or is it more
likely that the reply gets lost on its way?

Gerhard



More information about the slurm-users mailing list