[slurm-users] How to throttle sinfo/squeue/scontrol show so they don't throttle slurmctld
pedmon at cfa.harvard.edu
Mon Aug 17 20:00:28 UTC 2020
We've seen this in our shop. Our solutions have been:
1. User defer or max_rpc_cnt to slow down the scheduler so it can catch
up with RPC's
2. Target specific chatty users and tell them to knock it off. sdiag is
your friend for this. We also repeatedly tell users not to ping the
scheduler more often than once a minute.
3. We built a caching DB for squeue that gives users almost live data
instead of live data. So when they hit squeue that hit an external
database rather than the scheduler itself.
Also making sure you are on the latest version of Slurm is highly
recommended as there are numerous performance improvements.
For something straight out of the box though I would look at
defer/max_rpc_cnt as that will help the scheduler cope with high RPC
On 8/17/2020 2:30 PM, Ransom, Geoffrey M. wrote:
> We are having performance issues with slurtmctld (delayed
> sinfo/squeue results, socket timeouts for multiple sbatch calls,
> jobs/nodes sitting in COMP state for an extended period of time).
> We just fully switch to Slurm from Univa and I think our problem is
> users putting a lot of “scontrol show” calls (maybe squeue/sinfo as
> well) in large batches of jobs and essentially DOS-ing our scheduler.
> Is there a built in way to throttle “squeue/sinfo/scontrol show”
> commands in a reasonable manner so one user can’t do something dumb
> running jobs that keep calling these commands in bulk?
> If I need to make something up to verify this I am thinking about
> making a wrapper script around these commands that locks a shared
> temp file on the local disk (to avoid NFS locking issues) of each
> machine and then sleeps for 5 seconds before calling the real command
> and releasing the lock. At least this way a user will only run 1 copy
> per machine that their jobs land on instead of 1 per CPU core per machine.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users