Hoping someone may be able to help demystify some questions around the scheduler and preemption decisions.
I'm trying to help better predict the scheduler behavior as it pertains to preemption to create more predictable scheduling. Version is 24.11.5, OS U22.04
I have a fairly simple lo/hi partition which the same set of nodes are assigned
PartitionName=partition-lo Nodes=foo[00..NN] Default=YES MaxTime=INFINITE OverSubscribe=FORCE:1 State=UP PartitionName=partition-hi Nodes=foo[00..NN] Default=YES MaxTime=INFINITE OverSubscribe=NO State=UP PreemptMode=OFF
And then I have 2 QOSs per partition
Name Priority GraceTime Preempt PreemptMode UsageFactor MaxJobsPU MaxTRESPA MaxJobsPA
qos-stateless-lo 1 00:00:00 requeue 1.000000 qos-stateful-lo 1 00:00:00 suspend 1.000000 NN cpu=NNN NNN qos-stateful-hi 5 00:00:00 qos-state[ful,less]-lo cluster 1.000000 cpu=NNN qos-stateless-hi 5 00:00:00 qos-state[ful,less]-lo cluster 1.000000
And the general way that it works out is that stateful jobs will typically spawn stateless jobs. Stateful jobs will get suspended, while stateless jobs will get requeued. And then there are some general guard rails around stateful jobs clogging the queue and preventing stateless jobs from scheduling and creating a deadlock, but thats not the specific issue here.
SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_CPU_Memory SchedulerParameters=max_rpc_cnt=500,\ sched_min_interval=50000,\ sched_max_job_start=300,\ batch_sched_delay=6 PriorityType=priority/multifactor PreemptType=preempt/qos PreemptMode=SUSPEND,GANG
Workload is rather "high throughput", so a few settings were influenced by that https://slurm.schedmd.com/high_throughput.html guide.
What I end up seeing is that it can take sometimes, but not always, take a while for lo jobs to be preempted by hi jobs, an example below, but if the mailing list eats images, a link to it here: https://imgur.com/a/7xVFC8a This is a rather low resolution view, as it is just a scraper running on 5 minute increments, but the blue filled area are "hi" jobs running, where the yellow are "lo" jobs running. Second image (dashed lines) are pending jobs for the same partitions. Oddly, this specific instance did not show any preemption events in the slurmctld logs, but users/admins were a bit perplexed as to why this drug on for so long, and without preemption kicking in.
I was considering looking deeper into to try to better understand and predict preemption decisions: default_queue_depth partition_job_depth sched_interval sched_min_interval defer
Hopefully someone can point me to some nuggets of information around this? Appreciate any pointers, Reed