[slurm-users] Tuning the backfill scheduler

Wed Oct 10 19:04:35 MDT 2018

Hello list,

My cluster usually has a pretty heterogenous job load and spends a lot of the time memory bound.  Ocassionally I have users that submit 100k+ short, low resource jobs.  Despite having several thousand free cores and enough RAM to run the jobs, the backfill scheduler would never backfill them.  It turns out that there were a number of factors: They were deep down in the queue, from an account with low priority, and there were a lot of them for the scheduler to consider.  After a bunch of tuning, the backfill scheduler parameters I finally settled on are:

SchedulerParameters=defer,bf_continue,bf_interval=20,bf_resolution=600,bf_yield_interval=1000000,sched_min_interval=2000000,bf_max_time=600,bf_max_job_test=1000000

After implementing these changes the backfill scheduler began to successfully schedule these jobs on the cluster.  While the cluster has a deep queue, the load on the slurmctld host can get pretty high.  However no users have reported issues with responsivenes of the various slurm commands and the backup controller has never taken over either.  Changes have been in place for a month or so with no ill effects that I have observed.

While I was troubleshooting I was definitely combing the list archives for other people's tuning suggestions, so I figured I would post a message here for posterity as well as see if anyone has similiar settings or feedback :-).

Cheers,
Richard
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181010/6892a07d/attachment.sig>