<div dir="ltr">Hi Byron,<div><br></div><div>We ran into this with 20.02, and mitigated it with some kernel tuning. From our sysctl.conf:</div><div><br></div><div>net.core.somaxconn = 2048<br>net.ipv4.tcp_max_syn_backlog = 8192<br><br><br># prevent neighbour (aka ARP) table overflow...<br><br>net.ipv4.neigh.default.gc_thresh1 = 30000<br>net.ipv4.neigh.default.gc_thresh2 = 32000<br>net.ipv4.neigh.default.gc_thresh3 = 32768<br>net.ipv4.neigh.default.mcast_solicit = 9<br>net.ipv4.neigh.default.ucast_solicit = 9<br>net.ipv4.neigh.default.gc_stale_time = 86400<br>net.ipv4.neigh.eth0.mcast_solicit = 9<br>net.ipv4.neigh.eth0.ucast_solicit = 9<br>net.ipv4.neigh.eth0.gc_stale_time = 86400<br><br># enable selective ack algorithm<br>net.ipv4.tcp_sack = 1<br><br># workaround TIME_WAIT<br>net.ipv4.tcp_tw_reuse = 1<br># and since all traffic is local<br>net.ipv4.tcp_fin_timeout = 20<br></div><div><br></div><div><br></div><div>We have a 16-bit cluster network, so the ARP settings date to that. tcp_sack is more of a legacy setting from when some kernels didn't set it. </div><div><br></div><div>You likely would see tons of connections in TIME_WAIT if you ran "netstat -a" during periods when you're seeing the hangs. Our workaround settings have seemed to mitigate that.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jul 28, 2022 at 9:29 AM byron <<a href="mailto:lbgpublic@gmail.com">lbgpublic@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi <br></div><div><br></div><div>We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 times in 2 months) have slurmctld hanging so we get the following message when running sinfo</div><div><br></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none">“slurm_load_jobs error: Socket timed out on send/recv operation”</span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none"><br></span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none">It only seems to happen when one of our users runs a job that submits a short lived job every second for 5 days (up to 90,000 in a day). Although that could be a red-herring. <br></span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none"><br></span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none">There is nothing to be found in the slurmctld log.</span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none"><br></span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none">Can anyone suggest how to even start troubleshooting this? Without anything in the logs I dont know where to start.</span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none"><br></span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none">Thanks</span></div><div><span style="color:rgb(34,34,34);font-family:Arial,Helvetica,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;display:inline;float:none"><br></span></div></div>
</blockquote></div>