[slurm-users] Backfill Scheduling

Wed Jun 28 05:43:32 UTC 2023

Hi Reed,

Reed Dier <reed.dier at focusvq.com> writes:

>  On Jun 27, 2023, at 1:10 AM, Loris Bennett <loris.bennett at fu-berlin.de> wrote:
>
>  Hi Reed,
>
>  Reed Dier <reed.dier at focusvq.com> writes:
>
>  Is this an issue with the relative FIFO nature of the priority scheduling currently with all of the other factors disabled,
>  or since my queue is fairly deep, is this due to bf_max_job_test being
>  the default 100, and it can’t look deep enough into the queue to find
>  a job that will fit into what is unoccupied?
>
>  It could be that bf_max_job_test is too low.  On our system some users
>  think it is a good idea to submit lots of jobs with identical resource
>  requirements by writing a loop around sbatch.  Such jobs will exhaust
>  the bf_max_job_test very quickly.  Thus we increased the limit to 1000
>  and try to persuade users to use job arrays instead of home-grown loops.
>  This seem to work OK[1].
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Herr/Mr)
>  ZEDAT, Freie Universität Berlin
>
> Thanks Loris,
> I think this will be the next knob to turn and gives a bit more confidence to that, as we too have many such identical jobs.
>
>  On Jun 26, 2023, at 9:10 PM, Brian Andrus <toomuchit at gmail.com> wrote:
>
>  Reed,
>
>  You may want to look at the timelimit aspect of the job(s).
>
>  For one to 'squeeze in', it needs to be able to finish before the resources in use are expected to become available.
>
>  Consider:
>  Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour.
>  Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it waits.
>  Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm, well it won't finish before the time job B is expected to start, so it waits.
>  Pending job D (with even lower priority) needs 1 node for 30 minutes. That can squeeze in before the additional node for Job B is expected to be
>  available, so it runs on the idle node.
>
>  Brian Andrus
>
> Thanks Brian,
>
> Our layout is a bit less exciting, in that none of these are >1 node per job.
> So the blocking out nodes for job:node Tetris isn’t really at play here.
> The timing however is something I may turn an eye towards.
> Most jobs have a “sanity” time limit applied, in that it is not so much an expected time limit, but rather an “if it goes this long, something obviously went
> awry and we shouldn’t keep holding on to resources” limit.
> So its a bit hard to quantify the timing portion, but I haven’t looked into the slurm guesses of when it thinks the next task will start, etc.
>
> The pretty simplistic example at play here is that there are nodes that are ~50-60% loaded for CPU and memory.
> The next job up is a “whale” job that wants a ton of resources, cpu and/or memory, but down the line there is a job with 2 cpu’s and 2 gb of memory
> that can easily slot in to the unused resources.
>
> So my thinking was that the job_test list may be too short to actually get that far down the queue to see that it could shove that job into some holes.

You might also want to look at increasing bf_window to the maximum time
limit, as suggested in 'man slurm.conf'.  If backfill is not looking far
enough into the future to know whether starting a job early will
negatively impact a 'whale', then that 'whale' could potentially wait
indefinitely.  This is what happened on our system when we had a maximum
runtime of 14 days but the 1 day default for bf_window.  With both set
to 14 days the problem was solved.

Cheers,

Loris

> I’ll report back any findings after testing Loris’s suggestions.
>
> Appreciate everyone’s help and suggestions,
> Reed
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin