[slurm-users] Preemption Scheduler Tuning Knobs

13 Oct 2025

      Hoping someone may be able to help demystify some questions around the scheduler and preemption decisions.
I'm trying to help better predict the scheduler behavior as it pertains to preemption to create more predictable scheduling.
Version is 24.11.5, OS U22.04
I have a fairly simple lo/hi partition which the same set of nodes are assigned
...
PartitionName=partition-lo           Nodes=foo[00..NN]      Default=YES         MaxTime=INFINITE    OverSubscribe=FORCE:1   State=UP
PartitionName=partition-hi           Nodes=foo[00..NN]      Default=YES         MaxTime=INFINITE    OverSubscribe=NO        State=UP        PreemptMode=OFF
And then I have 2 QOSs per partition
...
        Name   Priority  GraceTime                Preempt PreemptMode UsageFactor MaxJobsPU     MaxTRESPA MaxJobsPA

qos-stateless-lo          1   00:00:00                            requeue    1.000000
 qos-stateful-lo          1   00:00:00                            suspend    1.000000        NN       cpu=NNN       NNN
 qos-stateful-hi          5   00:00:00 qos-state[ful,less]-lo     cluster    1.000000                 cpu=NNN
qos-stateless-hi          5   00:00:00 qos-state[ful,less]-lo     cluster    1.000000
And the general way that it works out is that stateful jobs will typically spawn stateless jobs.
Stateful jobs will get suspended, while stateless jobs will get requeued.
And then there are some general guard rails around stateful jobs clogging the queue and preventing stateless jobs from scheduling and creating a deadlock, but thats not the specific issue here.
...
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
SchedulerParameters=max_rpc_cnt=500,\
sched_min_interval=50000,\
sched_max_job_start=300,\
batch_sched_delay=6
PriorityType=priority/multifactor
PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG
Workload is rather "high throughput", so a few settings were influenced by that https://slurm.schedmd.com/high_throughput.html guide.
What I end up seeing is that it can take sometimes, but not always, take a while for lo jobs to be preempted by hi jobs, an example below, but if the mailing list eats images, a link to it here: https://imgur.com/a/7xVFC8a
This is a rather low resolution view, as it is just a scraper running on 5 minute increments, but the blue filled area are "hi" jobs running, where the yellow are "lo" jobs running.
Second image (dashed lines) are pending jobs for the same partitions.
Oddly, this specific instance did not show any preemption events in the slurmctld logs, but users/admins were a bit perplexed as to why this drug on for so long, and without preemption kicking in.
I was considering looking deeper into to try to better understand and predict preemption decisions:
default_queue_depth
partition_job_depth
sched_interval
sched_min_interval 
defer
Hopefully someone can point me to some nuggets of information around this?
Appreciate any pointers,
Reed

2025

2024

[slurm-users] Preemption Scheduler Tuning Knobs