[slurm-users] [EXT] Is it possible to set a default QOS per partition?

Sean Crosby scrosby at unimelb.edu.au
Tue Mar 2 07:54:12 UTC 2021


I would have thought partition QoS is the way to do this. We add partition
QoS to our partition definitions, and implement quotas on usage as well.

PartitionName=physical Nodes=... Default=YES MaxTime=30-0
DefaultTime=0:10:0 State=DOWN QoS=physical
TRESBillingWeights=CPU=1.0,Mem=4.0G

We then define the QoS "physical"

# sacctmgr show qos physical -p
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
physical|0|00:00:00|||cluster|||1.000000|||||||||||cpu=750,mem=9585888M|||cpu=750,mem=9585888M||||

We implement quotas using MaxTRESPerUser and MaxTRESPerAccount

It works really well for us. If you need to override it for a particular
group, you can create another QoS, set the OverPartQOS flag, and get the
users to specify that QoS.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 2 Mar 2021 at 08:24, Stack Korora <stackkorora at disroot.org> wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Greetings,
>
> We have different node classes that we've set up in different
> partitions. For example, we have our standard compute nodes in compute;
> our GPU's in a gpu partition; and jobs that need to run for months go
> into a long partition with a different set of machines.
>
> For each partition, we have QOS to prevent any single user from
> dominating the resources (set at a max of 60% of resources; not my call
> - it's politics - I'm not going down that rabbit hole...).
>
> Thus, I've got something like this in my slurm.conf (abbreviating to
> save space; sorry if I trim too much).
>
> PartitionName=compute [snip] AllowQOS=compute Default=YES
> PartitionName=gpu [snip] AllowQOS=gpu Default=NO
> PartitionName=long [snip] AllowQOS=long Default=NO
>
> Then I have my QOS configured. And in my `sacctmgr dump cluster | grep
> DefaultQOS` I have "DefaultQOS=compute".
>
> All of that works exactly as expected.
>
> This makes it easy/nice for my users to just do something like:
> $ sbatch -n1 -N1 -p compute script.sh
>
> They don't have to specify the QOS for compute and they like this.
>
> However, for the other partitions they have to do something like this:
> $ sbatch -n1 -N1 -p long --qos=long script.sh
>
> The users don't like this. (though with scripts, I don't see the big
> deal in just adding a new line...but you know... users...)
>
> The request from the users is to make a default QOS for each partition
> thus not needing to specify the QOS for the other partitions.
>
> Because the default is set in the cluster configuration, I'm not sure
> how to do this. And I'm not seeing anything in the documentation for a
> scenario like this.
>
> Question A:
> Anyone know how I can set a default QOS per partition?
>
> Question B:
> Chesterton's fence and all... Is there a better way to accomplish what
> we are attempting to do? I don't want a single QOS to limit across all
> partitions. I need a per partition limit that restricts users to 60% of
> resources in that partition.
>
> Thank you!
> ~Stack~
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210302/39670d79/attachment-0001.htm>


More information about the slurm-users mailing list