[slurm-users] How to throttle sinfo/squeue/scontrol show so they don't throttle slurmctld

Ransom, Geoffrey M. Geoffrey.Ransom at jhuapl.edu
Mon Aug 17 18:30:15 UTC 2020


Hello
    We are having performance issues with slurtmctld (delayed sinfo/squeue results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in COMP state for an extended period of time).

We just fully switch to Slurm from Univa and I think our problem is users putting a lot of "scontrol show" calls (maybe squeue/sinfo as well) in large batches of jobs and essentially DOS-ing our scheduler.

Is there a built in way to throttle "squeue/sinfo/scontrol show" commands in a reasonable manner so one user can't do something dumb running jobs that keep calling these commands in bulk?

If I need to make something up to verify this I am thinking about making a wrapper script around these commands that locks a shared  temp file on the local disk (to avoid NFS locking issues) of each machine and then sleeps for 5 seconds before calling the real command and releasing the lock. At least this way a user will only run 1 copy per machine that their jobs land on instead of 1 per CPU core per machine.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200817/7ba39870/attachment.htm>


More information about the slurm-users mailing list