[slurm-users] How to throttle sinfo/squeue/scontrol show so they don't throttle slurmctld
Ransom, Geoffrey M.
Geoffrey.Ransom at jhuapl.edu
Mon Aug 17 18:30:15 UTC 2020
We are having performance issues with slurtmctld (delayed sinfo/squeue results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in COMP state for an extended period of time).
We just fully switch to Slurm from Univa and I think our problem is users putting a lot of "scontrol show" calls (maybe squeue/sinfo as well) in large batches of jobs and essentially DOS-ing our scheduler.
Is there a built in way to throttle "squeue/sinfo/scontrol show" commands in a reasonable manner so one user can't do something dumb running jobs that keep calling these commands in bulk?
If I need to make something up to verify this I am thinking about making a wrapper script around these commands that locks a shared temp file on the local disk (to avoid NFS locking issues) of each machine and then sleeps for 5 seconds before calling the real command and releasing the lock. At least this way a user will only run 1 copy per machine that their jobs land on instead of 1 per CPU core per machine.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users