[slurm-users] Backfill Scheduling

Tue Jun 27 14:39:09 UTC 2023

> On Jun 27, 2023, at 1:10 AM, Loris Bennett <loris.bennett at fu-berlin.de> wrote:
> 
> Hi Reed,
> 
> Reed Dier <reed.dier at focusvq.com <mailto:reed.dier at focusvq.com>> writes:
> 
>> Is this an issue with the relative FIFO nature of the priority scheduling currently with all of the other factors disabled,
>> or since my queue is fairly deep, is this due to bf_max_job_test being
>> the default 100, and it can’t look deep enough into the queue to find
>> a job that will fit into what is unoccupied?
> 
> It could be that bf_max_job_test is too low.  On our system some users
> think it is a good idea to submit lots of jobs with identical resource
> requirements by writing a loop around sbatch.  Such jobs will exhaust
> the bf_max_job_test very quickly.  Thus we increased the limit to 1000
> and try to persuade users to use job arrays instead of home-grown loops.
> This seem to work OK[1].
> 
> Cheers,
> 
> Loris
> 
> -- 
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin

Thanks Loris,
I think this will be the next knob to turn and gives a bit more confidence to that, as we too have many such identical jobs.

> On Jun 26, 2023, at 9:10 PM, Brian Andrus <toomuchit at gmail.com> wrote:
> 
> Reed,
> 
> You may want to look at the timelimit aspect of the job(s).
> 
> For one to 'squeeze in', it needs to be able to finish before the resources in use are expected to become available.
> 
> Consider:
> Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour.
> Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it waits.
> Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm, well it won't finish before the time job B is expected to start, so it waits.
> Pending job D (with even lower priority) needs 1 node for 30 minutes. That can squeeze in before the additional node for Job B is expected to be available, so it runs on the idle node.
> 
> Brian Andrus

Thanks Brian,

Our layout is a bit less exciting, in that none of these are >1 node per job.
So the blocking out nodes for job:node Tetris isn’t really at play here.
The timing however is something I may turn an eye towards.
Most jobs have a “sanity” time limit applied, in that it is not so much an expected time limit, but rather an “if it goes this long, something obviously went awry and we shouldn’t keep holding on to resources” limit.
So its a bit hard to quantify the timing portion, but I haven’t looked into the slurm guesses of when it thinks the next task will start, etc.

The pretty simplistic example at play here is that there are nodes that are ~50-60% loaded for CPU and memory.
The next job up is a “whale” job that wants a ton of resources, cpu and/or memory, but down the line there is a job with 2 cpu’s and 2 gb of memory that can easily slot in to the unused resources.

So my thinking was that the job_test list may be too short to actually get that far down the queue to see that it could shove that job into some holes.

I’ll report back any findings after testing Loris’s suggestions.

Appreciate everyone’s help and suggestions,
Reed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230627/6e87f78d/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3857 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230627/6e87f78d/attachment-0001.bin>