[slurm-users] How to throttle sinfo/squeue/scontrol show so they don't throttle slurmctld

Mon Aug 17 19:06:49 UTC 2020

The slurm scheduler only locks out user requests when specific data
structures are locked due to modification, or potential modification.
So, the most effective technique is to limit the time window when that
will be happening by a combination of efficient traversal of the main
scheduling loop (when the list itself may be modified), or longer time
windows when state may be in flux from resources, such as node slurmd
to controller slurmctld RPCs.

First, please become familiar with the sdiag command and its output.
There is a huge unknown in any answer that anyone not on your systems
has. Specifically, we don't know the job mixture that is submitted to
your clusters. A predictable pattern of job submissions is ideal,
because you will be able to optimize slurm's parameters. This is
doesn't mean a constant pattern, necessarily. It may mean that you
recognize that your daytime load is different from your nighttime
load, is different from you end-of-semester load. Then you can
influence user behavior with reservations, QOS and/or partitions so
that user jobs that are in those different loads leave strong hints,
through the use of reservations, QOS, partitions or specific accounts
that can be recognized by the scheduler.

These presentations provide guidance:
  https://slurm.schedmd.com/SUG14/sched_tutorial.pdf <-- start here
  https://slurm.schedmd.com/SLUG19/Troubleshooting.pdf
There are also relevant tickets in bugs.schedmd.com where you may
compare your cluster's characteristics, role, job load, etc with
similar reported situations. Since submitted job load {type,
characteristics, frequency} dominate the scheduler behavior, it is not
possible to provide a set of one size fits all guidelines. Look for
ways that your site is similar to the use case in the tickets.

In our particular case, we did the following:
1) we found that the average # jobs in the queue could be traversed in
2.5 minutes, so increased the bf_interval to ~4 minutes (to handle
variability of the load)
2) there was almost always a small job that could be backfill-scheduled
3) we limited the bf_max_job_user_part to ~30, depending on whether
our users used --time-min and minnodes, which together make for
efficient backfill scheduling, so that even if a small # of users take
advantage of the backfill scheduler, they don't appear to take over
the machine. From a scheduling perspective, this isn't a bad thing,
but it makes for many headaches for user support folks who have to
answer why specific users can dominate the system.
4) Set bf_max_job_test= and default_queue_depth= so that the full
queue can be traversed, but so that there's a cutoff if the queue is
huge and there are too many potential backfillable jobs; set
bf_continue with these limits so jobs near the bottom of the queue
don't become starved.
5) Measure the # of active RPC both with sdiag and ncat or similar tools
    Consider increasing max_rpc_cnt based on these measurements.
6) Some of the guidelines in the high frequency/high throughput
guidance may be helpful, esp. increasing SOMAXCONN and other TCP
tuning. I would suggest caution when changing these, as they obviously
affect many different subsystems in addition to slurm. Unless you have
a dedicated slurm controller and scheduler node, you are increasing
risk and variability by making these changes.

For jobs sitting in COMP state, you may need to look at what is in
your epilogue or encourage other behaviors in job scripts,
applications or caching layers. Are jobs sitting in COMP state because
there's a lot of dirty I/O to be flushed? Does your epilog do the
equivalent of (fsync(F_DATASYNC))? Are the applications not syncing
their data during the job run? Some tuning of end-of-job timeouts
could be done in slurm, but this seems more of a symptom of
misbalanced caching and applications. Look for jobs stuck in, say,
Lustre I/O or network I/O and spikes in I/O right as jobs finish.

Hope this helps,
-Steve

On Mon, Aug 17, 2020 at 12:31 PM Ransom, Geoffrey M.
<Geoffrey.Ransom at jhuapl.edu> wrote:
>
>
>
> Hello
>
>     We are having performance issues with slurtmctld (delayed sinfo/squeue results, socket timeouts for multiple sbatch calls, jobs/nodes sitting in COMP state for an extended period of time).
>
>
>
> We just fully switch to Slurm from Univa and I think our problem is users putting a lot of “scontrol show” calls (maybe squeue/sinfo as well) in large batches of jobs and essentially DOS-ing our scheduler.
>
>
>
> Is there a built in way to throttle “squeue/sinfo/scontrol show” commands in a reasonable manner so one user can’t do something dumb running jobs that keep calling these commands in bulk?
>
>
>
> If I need to make something up to verify this I am thinking about making a wrapper script around these commands that locks a shared  temp file on the local disk (to avoid NFS locking issues) of each machine and then sleeps for 5 seconds before calling the real command and releasing the lock. At least this way a user will only run 1 copy per machine that their jobs land on instead of 1 per CPU core per machine.
>
>
>
> Thanks.