<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
I'm trying to puzzle out using QOS-based preemption instead of partition-based so we can have the juicy prize of PreemptExemptTime. But in the process, I've encountered something that puzzles ME.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
I have 2 partitions that, for the purposes of testing, are identical except for the QOS they have attached to them. Both partitions point to a single node and both have "Oversubscribe: NO" set. I'll call them open and sla-prio partitions.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
I then start 2 jobs which both ask for a majority of the cores on the node. The only difference between the 2 sbatchs are that they use different partitions and qos. I use the qos to try to tell them how to preempt and who has priority.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
QOS</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
Name Preempt PreemptMode</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof ContentPasted2">
sla open cluster <br class="ContentPasted2">
open requeue</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
slurm.conf</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof ContentPasted0 ContentPasted1">
SelectType=select/cons_tres
<div class="ContentPasted0">SelectTypeParameters=CR_Core_Memory</div>
<div class="ContentPasted0">PreemptMode=SUSPEND,GANG</div>
<div class="ContentPasted0">PreemptType=preempt/qos</div>
PartitionName=open Nodes=t-sc-1101 default=YES QOS=open CpuBind=core OverSubscribe=No <br>
PartitionName=sla-prio Nodes=t-sc-1101 default=NO QOS=sla CpuBind=core OverSubscribe=No <br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
What I'm finding is that, when I start the "lower priority" open QOS job on the open partition, it starts running on the node, taking more than half the cores. I then start the "higher priority" job on the sla-prio partition with the sla QOS. I would expect:</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">
<ol>
<li><span>The sla job would preempt the open job <span style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); display: inline !important;" class="ContentPasted3">(cancel or requeue)</span> because of the QOS settings .</span></li><li><span>That no matter what, the jobs would NOT share resources, as both partitions are set to OverSubscribe=NO.</span></li></ol>
<div><span>Yet when I start both jobs, I find them both running happily on the node. Since they both asked for more than half of the cores, then they are clearly sharing resources. I have found that if I make each job ask for ALL of the cores on the node,
THEN the preemption happens.</span></div>
<div><span><br>
</span></div>
<div><span>I'm sure I've wandered into some completely weird slurm backwaters with settings that no sane idiot would ever use...but I'm just trying to figure out what combination of settings ends up with oversubscribe happening when I thought I REALLY indicated
I didn't want oversubscribe to be happening.</span></div>
<div><span><br>
</span></div>
<div><span>Thanks for any help.</span></div>
</div>
</body>
</html>