As we've scaled up our slurm usage, we've noticed that short, moderate bursts of DNS lookup failures are enough to regularly stall slurmctld:
_xgetaddrinfo: getaddrinfo(runner-t35knco5d-project-54-concurrent-0:37251) failed
...this has a cascading effect where, when stalled, the controlled can't always communicate with nodes:
error: Error connecting, bad data: family = 0, port = 0
...and immediately the controller will mark the nodes as unhealthy, and kill jobs:
slurmctld: Killing JobId=3120751 on failed node slurm-0f6cacdc1
The reason for the DNS failures is not unreliable DNS server or network, but rather the jobs are submitted by containers that don't have resolvable hostnames. This traditionally hasn't disrupted functionality, but we've noticed if 8-10 jobs all terminate at the same time (submitter container SIGTERMs the srun process) that the controller can be easily overloaded for several seconds, despite having significant free system resources. gdb confirms the process is hanging on DNS. We also can see "Socket timed out on send/recv operation" from clients attempting to interact with the controller during the issue.
slurm 24.11.0
RHEL 8.10 kernel 4.18.0-553.58.1.el8_10.x86_64
We're looking into ways to get our ephemeral job submitter containers resolvable in DNS to prevent lookup failures (either by giving them resolvable hostnames, or by blackholing the records to 0.0.0.0 to allow for fast local failure on the slurmctld server). However, it does seem unusual for a handful of bad DNS lookups to cause so much disruption in slurmctld.
Is this a known weak point of ctld? The slurmctld host is a single-purpose 16 vCPU 30GB EC2 instance with minimal load. We have ~150 nodes, all nodes have valid IPs in slurm.conf to remove the need for ctld to perform lookups for nodes, but apparently there is still a need to lookup the submit host as well, and we can reliably reproduce these cascading failures.
Another possiblity might be to extend SlurmdTimeout to something very long and hope that the controller recovers from its stall in enough time to prevent from marking nodes as unhealthy and killing jobs, but it's not clear if that will have any effect since the first occurrence of "error: Error connecting, bad data: family = 0, port = 0" immediately drains nodes and kills jobs.
Thanks