[slurm-users] Backfill Scheduling

Tue Jun 27 06:10:31 UTC 2023

Hi Reed,

Reed Dier <reed.dier at focusvq.com> writes:

> Hoping this will be an easy one for the community.
>
> The priority schema was recently reworked for our cluster, with only
> PriorityWeightQOS and PriorityWeightAge contributing to the priority
> value, while PriorityWeightAssoc, PriorityWeightFairshare,
> PriorityWeightJobSize, and PriorityWeightPartition are now set to 0,
> and PriorityFavorSmall set to NO.
> The cluster is fairly loaded right now, with a big backlog of work (~250 running jobs, ~40K pending jobs).
> The majority of these jobs are arrays, which runs the pending job count up quickly.
>
> What I’m trying to figure out is:
> The next highest priority job array in the queue is waiting on resources, everything else on priority, which makes sense.
> However, there is a good portion of the cluster unused, seemingly
> dammed by the next up job being large, while there are much smaller
> jobs behind it that could easily fit into the available resources
> footprint.
>
> Is this an issue with the relative FIFO nature of the priority scheduling currently with all of the other factors disabled,
> or since my queue is fairly deep, is this due to bf_max_job_test being
> the default 100, and it can’t look deep enough into the queue to find
> a job that will fit into what is unoccupied?

It could be that bf_max_job_test is too low.  On our system some users
think it is a good idea to submit lots of jobs with identical resource
requirements by writing a loop around sbatch.  Such jobs will exhaust
the bf_max_job_test very quickly.  Thus we increased the limit to 1000
and try to persuade users to use job arrays instead of home-grown loops.
This seem to work OK[1].

Cheers,

Loris

> PriorityType=priority/multifactor
> SchedulerType=sched/backfill
>
> Hoping to know where I might want to swing my hammer next, without whacking the wrong setting
>
> Appreciate any advice,
> Reed
>

Footnotes:

[1] One problem we still have to address is that we don't have an
    array-enabled version of the 'subgXX' script for the quantum
    chemistry program Gaussian.  This is a Perl script which parses the
    input for the program, generates a job script and submits it.  An
    array-enabled version would have to stipulate a specific mapping
    between the array task ID and the way the input files are
    organised.  We are currently not sure about the best way to do this
    in a suitably generic way.

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin