[slurm-users] Need help configuring 3 tier priority/multifactor preemption in cluster

Wed Apr 15 19:30:17 UTC 2020

Hi Slurm-Users,

Hope this post finds all of you healthy and safe amidst the ongoing COVID19
craziness. We've got a strange error state that occurs when we enable
preemption and we need help diagnosing what is wrong. I'm not sure if we
are missing a default value or other necessary configuration, but while
trying to enable slurm preemption on a cluster with multiple queues slurm
itself stops reporting all 40+ CPUs on each node and only reports a single
cpu per node [after some random amount of time]. This is problematic on
multiple levels and has led to issues with users submitting jobs with more
than one CPU.

For some quick background on our setup we have a 100+ node linux cluster
which is built on lustre for storage, is managed using Bright View and uses
Slurm for its scheduler. The slurm.conf file lives on a shared volume that
is mounted across all the nodes on one of the lustre file systems. We have
defined a number of queues for slurm to use and have three distinct tiers
of workloads. Before setting out we looked around but were unable to find a
succinct how-to on the web describing how to configure this type of 3-tier
design we desired to make, so I'll outline the steps we took below. We've
tried a number of variations of the Examples from
https://slurm.schedmd.com/preempt.html but none exactly match the model we
desire so it may be we are missing key configuration options still.

The desired high-level design is for all compute and gpu nodes to exist in
a lowest priority "windfall" queue (PriorityTier value of 100) with a
medium priority pair of default queues above it  (PriorityTier value of
200) -- these are called "defq" and "gpuq" for ease of use -- and finally
20 or so specific high-priority queues for particular research groups above
that (PriorityTier value of 300) which are limited to just a few nodes per
queue and should take final precedence.

As for how to handle the preemption on each tier we don't plan to SUSPEND
jobs, but rather to CANCEL a windfall job or REQUEUE a defq/gpuq job when a
higher-priority job from the researcher specific queues requests a resource
that is already in use by a lower priority job. The final layout looks
something like so:

PriorityTier   PreemptMode   QueueType  NodeType
100             CANCEL            windfall       all
200             REQUEUE          defq           cpu
200             REQUEUE          gpuq          gpu
300             REQUEUE          lab1           cpu
300             REQUEUE          lab2           cpu
300             ...                     ...              ...
(etc.)

Once this was laid out the next step was to ensure each queue that we
created had a predefined value of "CANCEL" or "REQUEUE" rather than "OFF"
before enabling the 'preempt/partition_prio' plugin or we'd get an error.
Since the initial cluster design didn't use preemption we added the
PriorityType line first:

> PriorityType=priority/multifactor

Then we added the following 2 lines to the slurm.conf config file which
seemed to enable the preemption.

> PreemptType=preempt/partition_prio
> PreemptMode=REQUEUE

As far as I understand those 2 lines should enable the plugin (and set the
global default preemption mode for good measure).

For testing the changes we created a smaller queue with only 3 nodes so
that we could call up some interactive jobs and watch them be canceled or
requeued as we request higher priority workloads. Our issue occurs when we
enable the preempt type. At first everything seems to be working fine,
however after some random amount of time all the nodes stop reporting 40+
CPUs and report only a single CPU. This is visible to the admin via `sinfo
--Node --long` and to the users by the fact only single CPU jobs can be
requested.

It makes no sense. It's just like all of a sudden the computers only have
one CPU. All the more frustrating is the fact it also doesn't stop
misbehaving right away when we change it back to the previous
configuration.

Big question: Is this an issue anyone has seen before? Any clue what we are
doing wrong or how to further diagnose the problem when it occurs?

At the moment my thoughts for next steps are to turn up slurm debugging and
to purposefully let the error happen again, but testing on a production
cluster always scares me a little. Any thoughts about what log to check and
what kind of events to watch for would be greatly appreciated. We are open
to any thoughts or suggestions!

Also a bit unclear about how the priority calculation is made. I looked at
the values generated and they didn't seem to map to the changes in the
queues PriorityValue. I tried limiting the priority calculation to ONLY use
the partition priority with these additional config options below, but
still didn't get a nice clean calculation like I hoped.

> PriorityWeightFairshare=0
> PriorityWeightAge=0
> PriorityWeightTRES=0
> PriorityWeightPartition=100000
> PriorityWeightJobSize=0
> PriorityWeightQOS=0

Thanks in advance,
Josh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200415/9472201a/attachment.htm>