[slurm-users] Rate Limiting of RPC calls
Paul Edmon
pedmon at cfa.harvard.edu
Wed Feb 10 01:08:11 UTC 2021
We've hit this before several times. The tricks we've used to deal with
this are:
1. Being on the latest release: A lot of work has gone into improving
RPC throughput, if you aren't running the latest 20.11 release I highly
recommend upgrading. 20.02 also was pretty good at this.
2. max_rpc_cnt/defer: I would recommend using either of these settings
for SchedulerParameters as it will allow the scheduler more time to breathe.
3. I would make sure that your mysql settings are set such that your DB
is fully cached in memory and not hitting disk. I also recommend
running your DB on the same server as you run your ctld. We've found
that this can improve throughput.
4. We put a caching version of squeue in place which gives almost live
data to the users rather than live data. This additional buffer layer
helps cut down traffic. This is something we rolled in house with a
database that updates every 30 seconds.
5. Recommend to users to submit jobs that last for more than 10 minutes
and to use Job arrays instead of looping sbatch. This will reduce
thrashing.
Those are my recommendations for how to deal with this.
-Paul Edmon-
On 2/9/2021 7:59 PM, Kota Tsuyuzaki wrote:
> Hello guys,
>
> In our cluster, sometimes new incoming member accidentally creates too many slurm RPC calls (sbatch, sacct, etc), then slurmctld,
> slurmdbd, and mysql may be overloaded.
> To prevent such a situation, I'm looking for something like RPC Rate Limit for users. Does Slurm supports such a RateLimit feature?
> If not, is there way to save Slurm server-side resources?
>
> Best,
> Kota
>
> --------------------------------------------
> 露崎 浩太 (Kota Tsuyuzaki)
> kota.tsuyuzaki.pc at hco.ntt.co.jp
> NTTソフトウェアイノベーションセンタ
> 分散処理基盤技術プロジェクト
> 0422-59-2837
> ---------------------------------------------
>
>
>
>
More information about the slurm-users
mailing list