Hi all,
I cannot understand the true difference and definition of "core",
"task" and "cpu" within the limits associated to a partition via
the TRES variable of a QOS.
More precisely I have 2 partitions defined as follows:
PartitionName=lprod
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=lprod_part
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=r037c01s[01-12]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=432,mem=12M,node=12,billing=432
PartitionName=bprod
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=bprod_part
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=r037c01s[01-12]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=432,mem=12M,node=12,billing=432
with the two qos defined as:
Name Priority GraceTime PreemptMode UsageFactor
MaxJobsPU MaxSubmitPU MaxTRES
MinTRES Flags
-------------------- ---------- ---------- ----------- -----------
--------- ----------- -------------------- --------------------
--------------------------------------------------
lprod_part 0 00:00:00 cluster
1.000000 100 120 cpu=32,mem=366G
cpu=1 DenyOnLimit,PartitionMaxNodes,PartitionMinNodes
bprod_part 0 00:00:00 cluster
1.000000 100 120 cpu=64,mem=366G
cpu=33 DenyOnLimit,OverPartQOS
If I try to submit a job on the lprod partition with the
directives:
#SBATCH --nodes=12
#SBATCH --ntasks-per-node=1
the job is executed correctly while it is not executed if
submitted on the bprod partition due to the error:
sbatch: error: QOSMinCpuNotSatisfied
sbatch: error: Batch job submission failed: Job violates
accounting/QOS policy (job submit limit, user's size and/or time
limits)
I understand that this is related to the limit of cpu associated
to the partition via the qos: lprod->[1-32]cpus,
bprod->[33-64]cpus
but I would like to have a more proper explanation since the
options I am using are not referring to "cpu" but to "tasks" and I
could not find
a proper definition of "cpu" in the framework of the TRES.
Many thanks in advance.
Best,
Miriam