<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hoping someone can help point me towards some tweaks to help prevent denial-of-service issues.<div class=""><blockquote type="cite" class=""><font face="Menlo" class="">sbatch: error: Batch job submission failed: Socket timed out on send/recv operation<br class=""></font></blockquote><div class=""><br class=""></div><div class="">Root cause is understood, issues with shared storage for the slurmctld’s was impacted, leading to an increase in write latency to the StateSaveLocation.</div><div class="">Then with a large enough avalanche of job submissions, things the RPC’s would stack up and stop responding.</div><div class=""><br class=""></div><div class="">I’ve been running well with some tweaks sourced from the “high-throughput” <a href="https://slurm.schedmd.com/high_throughput.html" class="">guide</a>. </div><div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><div class=""><font face="Menlo" class="">SchedulerParameters=max_rpc_cnt=400,\</font></div><div class=""><font face="Menlo" class="">sched_min_interval=50000,\</font></div><div class=""><font face="Menlo" class="">sched_max_job_start=300,\</font></div><div class=""><font face="Menlo" class="">batch_sched_delay=6</font></div></blockquote><blockquote type="cite" class=""><font face="Menlo" class="">KillWait=30<br class="">MessageTimeout=30</font><br class=""></blockquote><br class=""></div><div class="">I’m assuming that I was running into batch_sched_delay because looking at sdiag after the fact, it was <i class="">averaging</i> .2s, and total time is 5.5h out of 16h8m18s at the time of the sdiag sample.</div></div><div class=""><blockquote type="cite" class=""><div class=""><font face="Menlo" class="">*******************************************************</font></div><div class=""><font face="Menlo" class="">sdiag output at Thu Jan 25 11:08:18 2024 (1706198898)</font></div><div class=""><font face="Menlo" class="">Data since Wed Jan 24 19:00:00 2024 (1706140800)</font></div><div class=""><font face="Menlo" class="">*******************************************************</font></div></blockquote><blockquote type="cite" class=""> REQUEST_SUBMIT_BATCH_JOB ( 4003) count:98400 ave_time:201442 total_time:19821991013</blockquote><br class=""></div><div class="">Currently on 22.05.8, but hoping to get to 23.02.7 soon™, and I think this could possible resolve the issue well enough if I’m reading it correctly from the <a href="https://slurm.schedmd.com/archive/slurm-23.02-latest/news.html" class="">release notes</a>?</div><div class=""><br class=""></div><div class=""></div><blockquote type="cite" class=""><div class=""><font face="Menlo" class="">HIGHLIGHTS</font></div><div class=""><font face="Menlo" class="">==========</font></div><div class=""><font face="Menlo" class=""> -- slurmctld - Add new RPC rate limiting feature. This is enabled through</font></div><div class=""><font face="Menlo" class=""> SlurmctldParameters=rl_enable, otherwise disabled by default.</font></div></blockquote><div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><a href="https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable" class=""><font face="Menlo" class="">rl_enable</font></a></blockquote><blockquote type="cite" class=""><font face="Menlo" class="">Enable per-user RPC rate-limiting support. Client-commands will be told to back off and sleep for a second once the limit has been reached. This is implemented as a "token bucket", which permits a certain degree of "bursty" RPC load from an individual user before holding them to a steady-state RPC load established by the refill period and rate.</font></blockquote><br class=""></div><div class="">But given that the hardware seems to be well over provisioned, CPU never drops below 5% idle, it feels like there is more room to squeeze some optimization out of here that I’m missing in the interim, and hoping to get a better overall understanding in the process.</div><div class="">I scrape the DBD Agent queue size from sdiag every 30s and the largest value I saw was 115, which is much higher than normal, but should be well below MaxDBDMsgs, where the <i class="">minimum</i> value is 10000.</div><div class=""><span style="font-style: normal;" class=""><br class=""></span></div><div class="">I would really hope that I didn’t potentially hit a 30s MessageTimeout value, but I guess thats on the table all well, not knowing if that would potentially trigger the sbatch submission failure like that.</div><div class=""><span style="font-style: normal;" class=""><br class=""></span></div><div class="">Just moving the max_rpc_cnt value up seems like an easy button, but also seems like it could have some adverse effects for backfill scheduling, and may be diminishing returns for actually keeping RPCs flowing?</div><div class=""><blockquote type="cite" class=""><font face="Menlo" class="">Setting max_rpc_cnt to more than 256 will be only useful to let backfill continue scheduling work after locks have been yielded (i.e. each 2 seconds) if there are a maximum of MAX(max_rpc_cnt/10, 20) RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler will be allowed to continue after yielding locks only when there are less than or equal to 100 pending RPCs. </font></blockquote><br class=""></div><div class="">Obviously, fix the storage is the real solution, but hoping that there may be more goodness to unlock, even if it is as simple as “upgrade to 23.02”.</div><div class=""><br class=""></div><div class="">Appreciate any insight,</div><div class="">Reed</div></body></html>