[slurm-users] Tuning the backfill scheduler

Thu Oct 11 05:54:11 MDT 2018

Hi

We've run into similar problems with backfill (though not apparently of the
scale you've got).  We have a number of users who will drop 5,000+ jobs at
once- as you've indicated, this can play havoc with backfill.

One of the newer* parameters for the backfill scheduler that's been a real
help for us is "bf_max_job_assoc" and "bf_max_job_user".  These limit the
number of jobs the scheduler considers per association and user.

SchedulerParameters=bf_continue,bf_interval=120,bf_job_part_count_reserve=6,bf_window=43200,bf_resolution=1800,bf_max_job_user=200,bf_max_job_assoc=200,bf_max_job_part=500,bf_max_job_test=2000,bf_yield_interval=1000000,default_queue_depth=500,defer,partition_job_depth=300,max_rpc_cnt=200,preempt_youngest_first

- Michael

*I think these are newer- I don't actually know when those were added (I'm
currently on 17.11.5)

On Wed, Oct 10, 2018 at 6:08 PM Richard Feltstykket <
rafeltstykket at ucdavis.edu> wrote:

> Hello list,
>
> My cluster usually has a pretty heterogenous job load and spends a lot of
> the time memory bound.  Ocassionally I have users that submit 100k+ short,
> low resource jobs.  Despite having several thousand free cores and enough
> RAM to run the jobs, the backfill scheduler would never backfill them.  It
> turns out that there were a number of factors: They were deep down in the
> queue, from an account with low priority, and there were a lot of them for
> the scheduler to consider.  After a bunch of tuning, the backfill scheduler
> parameters I finally settled on are:
>
>
> SchedulerParameters=defer,bf_continue,bf_interval=20,bf_resolution=600,bf_yield_interval=1000000,sched_min_interval=2000000,bf_max_time=600,bf_max_job_test=1000000
>
> After implementing these changes the backfill scheduler began to
> successfully schedule these jobs on the cluster.  While the cluster has a
> deep queue, the load on the slurmctld host can get pretty high.  However no
> users have reported issues with responsivenes of the various slurm commands
> and the backup controller has never taken over either.  Changes have been
> in place for a month or so with no ill effects that I have observed.
>
> While I was troubleshooting I was definitely combing the list archives for
> other people's tuning suggestions, so I figured I would post a message here
> for posterity as well as see if anyone has similiar settings or feedback
> :-).
>
> Cheers,
> Richard
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181011/d5cc212f/attachment-0001.html>