Hi Miriam, 

The Definition of cpu is “fluid” . It depends on hardware and configuration. If threads are defined then cpu may relate to one thread whereas on hardware configurations without threads it will refer to a physical core. https://slurm.schedmd.com/mc_support.html#defs
Didn’t you set mintres to be cpu=33? Therefore a job asking for 12 tasks(=cpus) has to be rejected, doesn’t it?
Best,
Andreas
Am 04.12.2024 um 11:18 schrieb Miriam Olmi via slurm-users <slurm-users@lists.schedmd.com>:

 Hi all,

I cannot understand the true difference and definition of "core", "task" and "cpu" within the limits associated to a partition via the TRES variable of a QOS.

More precisely I have 2 partitions defined as follows:


PartitionName=lprod
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=lprod_part
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=r037c01s[01-12]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=432,mem=12M,node=12,billing=432

PartitionName=bprod
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=bprod_part
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=r037c01s[01-12]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=432,mem=12M,node=12,billing=432



with the two qos defined as:

                Name   Priority  GraceTime PreemptMode UsageFactor MaxJobsPU MaxSubmitPU              MaxTRES              MinTRES                                              Flags
-------------------- ---------- ---------- ----------- ----------- --------- ----------- -------------------- -------------------- --------------------------------------------------
          lprod_part          0   00:00:00     cluster    1.000000       100         120      cpu=32,mem=366G                cpu=1    DenyOnLimit,PartitionMaxNodes,PartitionMinNodes
          bprod_part          0   00:00:00     cluster    1.000000       100         120      cpu=64,mem=366G               cpu=33                            DenyOnLimit,OverPartQOS



If I try to submit a job on the lprod partition with the directives:

#SBATCH --nodes=12
#SBATCH --ntasks-per-node=1

the job is executed correctly while it is not executed if submitted on the bprod partition due to the error:

sbatch: error: QOSMinCpuNotSatisfied
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)


I understand that this is related to the limit of cpu associated to the partition via the qos: lprod->[1-32]cpus, bprod->[33-64]cpus
but I would like to have a more proper explanation since the options I am using are not referring to "cpu" but to "tasks" and I could not find
a proper definition of "cpu" in the framework of the TRES.

Many thanks in advance.

Best,
Miriam


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com