[slurm-users] slurmctld hanging
samuel_fulcomer at brown.edu
Thu Jul 28 13:49:00 UTC 2022
We ran into this with 20.02, and mitigated it with some kernel tuning. From
net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 8192
# prevent neighbour (aka ARP) table overflow...
net.ipv4.neigh.default.gc_thresh1 = 30000
net.ipv4.neigh.default.gc_thresh2 = 32000
net.ipv4.neigh.default.gc_thresh3 = 32768
net.ipv4.neigh.default.mcast_solicit = 9
net.ipv4.neigh.default.ucast_solicit = 9
net.ipv4.neigh.default.gc_stale_time = 86400
net.ipv4.neigh.eth0.mcast_solicit = 9
net.ipv4.neigh.eth0.ucast_solicit = 9
net.ipv4.neigh.eth0.gc_stale_time = 86400
# enable selective ack algorithm
net.ipv4.tcp_sack = 1
# workaround TIME_WAIT
net.ipv4.tcp_tw_reuse = 1
# and since all traffic is local
net.ipv4.tcp_fin_timeout = 20
We have a 16-bit cluster network, so the ARP settings date to that.
tcp_sack is more of a legacy setting from when some kernels didn't set it.
You likely would see tons of connections in TIME_WAIT if you ran "netstat
-a" during periods when you're seeing the hangs. Our workaround settings
have seemed to mitigate that.
On Thu, Jul 28, 2022 at 9:29 AM byron <lbgpublic at gmail.com> wrote:
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
> (3 times in 2 months) have slurmctld hanging so we get the following
> message when running sinfo
> “slurm_load_jobs error: Socket timed out on send/recv operation”
> It only seems to happen when one of our users runs a job that submits a
> short lived job every second for 5 days (up to 90,000 in a day). Although
> that could be a red-herring.
> There is nothing to be found in the slurmctld log.
> Can anyone suggest how to even start troubleshooting this? Without
> anything in the logs I dont know where to start.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users