[slurm-users] Preemption not working as expected when using SelectType = select/cons_res

Thu Mar 9 19:12:01 UTC 2023

Hi Folks,

I've got a medium size cluster running slurm 20.11.8 with a bunch of QoS,
Accounts and partitions configured and 3 tiers of priority which "float"
over the nodes which are also assigned to the lower tiers rather than
having specific nodes assigned to them.

The cluster wide preempt settings are:

PreemptMode             = CANCEL
PreemptType             = preempt/qos

I've also configured the weight of the qos to be the highest value factor
when determining what is preempted. Here are the priority settings:

PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = CALCULATE_RUNNING,NO_NORMAL_ALL
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 10000
PriorityWeightQOS       = 1000000
PriorityWeightTRES      = (null)

These 3 QoS tiers (which each have an associated account and partition)
are:
1. windfall (no limits on nodes or jobs) which is preempted by the
2. cpuq and gpuq (limit 16 jobs/16 nodes per user per queue) which is
preempted by
3. 16 group-specific queues with custom limits (between 4-12 nodes per
group) which float either over the cpuq or gpuq and should instantly
preempt jobs on both windfall and cpuq/gpuq

Before we moved off select/linear the preemption was behaving as expected:
when a higher priority job on a higher QoS tier would be submitted it would
cancel the lower priority jobs as required and start running right away.
However, when I moved the cluster to using select/cons_res about a year ago
to enable users to run array jobs more effectively and to make the cluster
utilization more efficient we started seeing some different behavior where
higher priority jobs get "stranded" while awaiting resources.

As an example we have recently seen full node jobs come in on the highest
priority partitions and they don't immediately preempt the jobs on the cpuq
or gpuq despite having much higher priority. Rather these high-priority
jobs enter a state of "resources" (or "priority") and sit in the queue.
These high priority jobs are within the resource limits of the partition
and the qos so should not be held up by design.

Can anyone explain what is going on here? One thing I have  noticed is that
the jobs which are running on the lower priority partition are usually NOT
using all the resources of the nodes to which they are assigned. In other
words those nodes show up as "mixed" rather than "allocated" in sinfo.

For instance, the last time this occurred there were 16 preemptable gpuq
jobs that were each running 1 to a node using 24 cores and all the RAM. The
system owner requested that this specific queue comp-astro be set to
user-exclusive (since their jobs are typically requiring full throughput of
the filesystem), and not sure how that factors in here. However, these
higher priority jobs did not kick the lower priority gpuq jobs out. Here is
an example of the lower priority sprio output of one of the lower priority
jobs

  JOBID PARTITION   PRIORITY       SITE        AGE    JOBSIZE  PARTITION
     QOS
  171411 gpuq        20010017          0         10          7      10000
20000000

and here for the higer priority job

  JOBID PARTITION   PRIORITY       SITE        AGE    JOBSIZE  PARTITION
     QOS
171441 comp-astr 1000010011          0          2          9      10000
1000000000

What I am trying to understand is why this lead to a state like this in
squeue

Here is squeue:
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
            171442 comp-astr  jid1    user PD       0:00      1 (Priority)
            171443 comp-astr  jid2    user PD       0:00      1 (Priority)
            171444 comp-astr  jid3    user PD       0:00      1 (Priority)
            171445 comp-astr  jid4    user PD       0:00      1 (Priority)
            171446 comp-astr  jid5    user PD       0:00      1 (Priority)
            171447 comp-astr  jid6    user PD       0:00      1 (Priority)
            171441 comp-astr  jid7    user PD       0:00      1 (Resources)
            171418      gpuq     jig1   user2  R       7:23      1 gpu010
            171417      gpuq     jig2   juser2  R       7:33      1 gpu002
            171416      gpuq     jig3   user2  R       7:46      1 gpu021
            171415      gpuq     jig4   user2  R       7:47      1 gpu012
            171414      gpuq     jig5   user2  R       7:50      1 gpu019
            171413      gpuq     jig6   user2  R       7:54      1 gpu025
            171412      gpuq     jig7   user2  R       8:27      1 gpu027
            171411      gpuq     jig8   user2  R       8:28      1 gpu009

If it is indeed because of the open resources on the node that causes the
preemption to not take place is there a way we can force the jobs on the
gpuq to use all the resources so that it will work as expected? Do we need
to have gpuq use --exclusive so that its jobs will always be cancelled?

Thanks for your time,
Josh

-- 
*Josh Sonstroem*
*Sr. Platform Engineer*
Comm 33 / Information Technology Services
cell (831) 332-0096
desk (831) 459-1526
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230309/150f5dc8/attachment.htm>