Hoping someone may be able to help demystify some questions around the scheduler and preemption decisions.

I'm trying to help better predict the scheduler behavior as it pertains to preemption to create more predictable scheduling.

Version is 24.11.5, OS U22.04

I have a fairly simple lo/hi partition which the same set of nodes are assigned

PartitionName=partition-lo Nodes=foo[00..NN] Default=YES MaxTime=INFINITE OverSubscribe=FORCE:1 State=UP
PartitionName=partition-hi Nodes=foo[00..NN] Default=YES MaxTime=INFINITE OverSubscribe=NO State=UP PreemptMode=OFF

And then I have 2 QOSs per partition

Name Priority GraceTime Preempt PreemptMode UsageFactor MaxJobsPU MaxTRESPA MaxJobsPA
---------------- ---------- ---------- ---------------------- ----------- ----------- --------- ------------- ---------
qos-stateless-lo 1 00:00:00 requeue 1.000000
qos-stateful-lo 1 00:00:00 suspend 1.000000 NN cpu=NNN NNN
qos-stateful-hi 5 00:00:00 qos-state[ful,less]-lo cluster 1.000000 cpu=NNN
qos-stateless-hi 5 00:00:00 qos-state[ful,less]-lo cluster 1.000000

And the general way that it works out is that stateful jobs will typically spawn stateless jobs.

Stateful jobs will get suspended, while stateless jobs will get requeued.

And then there are some general guard rails around stateful jobs clogging the queue and preventing stateless jobs from scheduling and creating a deadlock, but thats not the specific issue here.

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
SchedulerParameters=max_rpc_cnt=500,\
sched_min_interval=50000,\
sched_max_job_start=300,\
batch_sched_delay=6
PriorityType=priority/multifactor
PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG

Workload is rather "high throughput", so a few settings were influenced by that guide.

What I end up seeing is that it can take sometimes, but not always, take a while for lo jobs to be preempted by hi jobs, an example below, but if the mailing list eats images, a link to it here: https://imgur.com/a/7xVFC8a

This is a rather low resolution view, as it is just a scraper running on 5 minute increments, but the blue filled area are "hi" jobs running, where the yellow are "lo" jobs running.

Second image (dashed lines) are pending jobs for the same partitions.

Oddly, this specific instance did not show any preemption events in the slurmctld logs, but users/admins were a bit perplexed as to why this drug on for so long, and without preemption kicking in.

I was considering looking deeper into to try to better understand and predict preemption decisions:

default_queue_depth
partition_job_depth
sched_interval
sched_min_interval
defer

Hopefully someone can point me to some nuggets of information around this?

Appreciate any pointers,

Reed