[slurm-users] Oversubscribing even though it's set to No on both partitions
rug262 at psu.edu
Fri Mar 24 15:35:19 UTC 2023
I'm trying to puzzle out using QOS-based preemption instead of partition-based so we can have the juicy prize of PreemptExemptTime. But in the process, I've encountered something that puzzles ME.
I have 2 partitions that, for the purposes of testing, are identical except for the QOS they have attached to them. Both partitions point to a single node and both have "Oversubscribe: NO" set. I'll call them open and sla-prio partitions.
I then start 2 jobs which both ask for a majority of the cores on the node. The only difference between the 2 sbatchs are that they use different partitions and qos. I use the qos to try to tell them how to preempt and who has priority.
Name Preempt PreemptMode
sla open cluster
PartitionName=open Nodes=t-sc-1101 default=YES QOS=open CpuBind=core OverSubscribe=No
PartitionName=sla-prio Nodes=t-sc-1101 default=NO QOS=sla CpuBind=core OverSubscribe=No
What I'm finding is that, when I start the "lower priority" open QOS job on the open partition, it starts running on the node, taking more than half the cores. I then start the "higher priority" job on the sla-prio partition with the sla QOS. I would expect:
1. The sla job would preempt the open job (cancel or requeue) because of the QOS settings .
2. That no matter what, the jobs would NOT share resources, as both partitions are set to OverSubscribe=NO.
Yet when I start both jobs, I find them both running happily on the node. Since they both asked for more than half of the cores, then they are clearly sharing resources. I have found that if I make each job ask for ALL of the cores on the node, THEN the preemption happens.
I'm sure I've wandered into some completely weird slurm backwaters with settings that no sane idiot would ever use...but I'm just trying to figure out what combination of settings ends up with oversubscribe happening when I thought I REALLY indicated I didn't want oversubscribe to be happening.
Thanks for any help.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users