[slurm-users] slurmctld hanging

Thu Jul 28 13:49:00 UTC 2022

Hi Byron,

We ran into this with 20.02, and mitigated it with some kernel tuning. From
our sysctl.conf:

net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 8192

# prevent neighbour (aka ARP) table overflow...

net.ipv4.neigh.default.gc_thresh1 = 30000
net.ipv4.neigh.default.gc_thresh2 = 32000
net.ipv4.neigh.default.gc_thresh3 = 32768
net.ipv4.neigh.default.mcast_solicit = 9
net.ipv4.neigh.default.ucast_solicit = 9
net.ipv4.neigh.default.gc_stale_time = 86400
net.ipv4.neigh.eth0.mcast_solicit = 9
net.ipv4.neigh.eth0.ucast_solicit = 9
net.ipv4.neigh.eth0.gc_stale_time = 86400

# enable selective ack algorithm
net.ipv4.tcp_sack = 1

# workaround TIME_WAIT
net.ipv4.tcp_tw_reuse = 1
# and since all traffic is local
net.ipv4.tcp_fin_timeout = 20

We have a 16-bit cluster network, so the ARP settings date to that.
tcp_sack is more of a legacy setting from when some kernels didn't set it.

You likely would see tons of connections in TIME_WAIT if you ran "netstat
-a" during periods when you're seeing the hangs. Our workaround settings
have seemed to mitigate that.

On Thu, Jul 28, 2022 at 9:29 AM byron <lbgpublic at gmail.com> wrote:

> Hi
>
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
> (3 times in 2 months) have slurmctld hanging so we get the following
> message when running sinfo
>
> “slurm_load_jobs error: Socket timed out on send/recv operation”
>
> It only seems to happen when one of our users runs a job that submits a
> short lived job every second for 5 days (up to 90,000 in a day).  Although
> that could be a red-herring.
>
> There is nothing to be found in the slurmctld log.
>
> Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
>
> Thanks
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220728/3aa90a3c/attachment.htm>