[slurm-users] Rate Limiting of RPC calls

Wed Feb 10 01:08:11 UTC 2021

We've hit this before several times. The tricks we've used to deal with 
this are:

1. Being on the latest release: A lot of work has gone into improving 
RPC throughput, if you aren't running the latest 20.11 release I highly 
recommend upgrading.  20.02 also was pretty good at this.

2. max_rpc_cnt/defer: I would recommend using either of these settings 
for SchedulerParameters as it will allow the scheduler more time to breathe.

3. I would make sure that your mysql settings are set such that your DB 
is fully cached in memory and not hitting disk.  I also recommend 
running your DB on the same server as you run your ctld.  We've found 
that this can improve throughput.

4. We put a caching version of squeue in place which gives almost live 
data to the users rather than live data.  This additional buffer layer 
helps cut down traffic.  This is something we rolled in house with a 
database that updates every 30 seconds.

5. Recommend to users to submit jobs that last for more than 10 minutes 
and to use Job arrays instead of looping sbatch.  This will reduce 
thrashing.

Those are my recommendations for how to deal with this.

-Paul Edmon-

On 2/9/2021 7:59 PM, Kota Tsuyuzaki wrote:
> Hello guys,
>
> In our cluster, sometimes new incoming member accidentally creates too many slurm RPC calls (sbatch, sacct, etc), then slurmctld,
> slurmdbd, and mysql may be overloaded.
> To prevent such a situation, I'm looking for something like RPC Rate Limit for users. Does Slurm supports such a RateLimit feature?
> If not, is there way to save Slurm server-side resources?
>
> Best,
> Kota
>
> --------------------------------------------
> 露崎　浩太 (Kota Tsuyuzaki)
> kota.tsuyuzaki.pc at hco.ntt.co.jp
> NTTソフトウェアイノベーションセンタ
> 分散処理基盤技術プロジェクト
> 0422-59-2837
> ---------------------------------------------
>
>
>
>