Hoping someone can help point me towards some tweaks to help prevent denial-of-service issues.
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Root cause is understood, issues with shared storage for the slurmctld’s was impacted, leading to an increase in write latency to the StateSaveLocation. Then with a large enough avalanche of job submissions, things the RPC’s would stack up and stop responding.
I’ve been running well with some tweaks sourced from the “high-throughput” guide https://slurm.schedmd.com/high_throughput.html.
SchedulerParameters=max_rpc_cnt=400,\ sched_min_interval=50000,\ sched_max_job_start=300,\ batch_sched_delay=6 KillWait=30 MessageTimeout=30
I’m assuming that I was running into batch_sched_delay because looking at sdiag after the fact, it was averaging .2s, and total time is 5.5h out of 16h8m18s at the time of the sdiag sample.
sdiag output at Thu Jan 25 11:08:18 2024 (1706198898) Data since Wed Jan 24 19:00:00 2024 (1706140800)
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:98400 ave_time:201442 total_time:19821991013
Currently on 22.05.8, but hoping to get to 23.02.7 soon™, and I think this could possible resolve the issue well enough if I’m reading it correctly from the release notes https://slurm.schedmd.com/archive/slurm-23.02-latest/news.html?
HIGHLIGHTS
-- slurmctld - Add new RPC rate limiting feature. This is enabled through SlurmctldParameters=rl_enable, otherwise disabled by default.
rl_enable https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enableEnable per-user RPC rate-limiting support. Client-commands will be told to back off and sleep for a second once the limit has been reached. This is implemented as a "token bucket", which permits a certain degree of "bursty" RPC load from an individual user before holding them to a steady-state RPC load established by the refill period and rate.
But given that the hardware seems to be well over provisioned, CPU never drops below 5% idle, it feels like there is more room to squeeze some optimization out of here that I’m missing in the interim, and hoping to get a better overall understanding in the process. I scrape the DBD Agent queue size from sdiag every 30s and the largest value I saw was 115, which is much higher than normal, but should be well below MaxDBDMsgs, where the minimum value is 10000.
I would really hope that I didn’t potentially hit a 30s MessageTimeout value, but I guess thats on the table all well, not knowing if that would potentially trigger the sbatch submission failure like that.
Just moving the max_rpc_cnt value up seems like an easy button, but also seems like it could have some adverse effects for backfill scheduling, and may be diminishing returns for actually keeping RPCs flowing?
Setting max_rpc_cnt to more than 256 will be only useful to let backfill continue scheduling work after locks have been yielded (i.e. each 2 seconds) if there are a maximum of MAX(max_rpc_cnt/10, 20) RPCs in the queue. i.e. max_rpc_cnt=1000, the scheduler will be allowed to continue after yielding locks only when there are less than or equal to 100 pending RPCs.
Obviously, fix the storage is the real solution, but hoping that there may be more goodness to unlock, even if it is as simple as “upgrade to 23.02”.
Appreciate any insight, Reed