Hi all,
I cannot understand the true difference and definition of "core", "task"
and "cpu" within the limits associated to a partition via the TRES
variable of a QOS.
More precisely I have 2 partitions defined as follows:
PartitionName=lprod
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=lprod_part
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=r037c01s[01-12]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=432,mem=12M,node=12,billing=432
PartitionName=bprod
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=bprod_part
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=r037c01s[01-12]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=432,mem=12M,node=12,billing=432
with the two qos defined as:
Name Priority GraceTime PreemptMode UsageFactor
MaxJobsPU MaxSubmitPU MaxTRES
MinTRES Flags
-------------------- ---------- ---------- ----------- -----------
--------- ----------- -------------------- --------------------
--------------------------------------------------
lprod_part 0 00:00:00 cluster 1.000000
100 120 cpu=32,mem=366G cpu=1
DenyOnLimit,PartitionMaxNodes,PartitionMinNodes
bprod_part 0 00:00:00 cluster 1.000000
100 120 cpu=64,mem=366G cpu=33
DenyOnLimit,OverPartQOS
If I try to submit a job on the lprod partition with the directives:
#SBATCH --nodes=12
#SBATCH --ntasks-per-node=1
the job is executed correctly while it is not executed if submitted on
the bprod partition due to the error:
sbatch: error: QOSMinCpuNotSatisfied
sbatch: error: Batch job submission failed: Job violates accounting/QOS
policy (job submit limit, user's size and/or time limits)
I understand that this is related to the limit of cpu associated to the
partition via the qos: lprod->[1-32]cpus, bprod->[33-64]cpus
but I would like to have a more proper explanation since the options I
am using are not referring to "cpu" but to "tasks" and I could not find
a proper definition of "cpu" in the framework of the TRES.
Many thanks in advance.
Best,
Miriam