Hi all,
I cannot understand the true difference and definition of "core", "task" and "cpu" within the limits associated to a partition via the TRES variable of a QOS.
More precisely I have 2 partitions defined as follows:
PartitionName=lprod AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=lprod_part DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=r037c01s[01-12] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=432,mem=12M,node=12,billing=432
PartitionName=bprod AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=bprod_part DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=r037c01s[01-12] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=432 TotalNodes=12 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=432,mem=12M,node=12,billing=432
with the two qos defined as:
Name Priority GraceTime PreemptMode UsageFactor MaxJobsPU MaxSubmitPU MaxTRES MinTRES Flags -------------------- ---------- ---------- ----------- ----------- --------- ----------- -------------------- -------------------- -------------------------------------------------- lprod_part 0 00:00:00 cluster 1.000000 100 120 cpu=32,mem=366G cpu=1 DenyOnLimit,PartitionMaxNodes,PartitionMinNodes bprod_part 0 00:00:00 cluster 1.000000 100 120 cpu=64,mem=366G cpu=33 DenyOnLimit,OverPartQOS
If I try to submit a job on the lprod partition with the directives:
#SBATCH --nodes=12 #SBATCH --ntasks-per-node=1
the job is executed correctly while it is not executed if submitted on the bprod partition due to the error:
sbatch: error: QOSMinCpuNotSatisfied sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
I understand that this is related to the limit of cpu associated to the partition via the qos: lprod->[1-32]cpus, bprod->[33-64]cpus but I would like to have a more proper explanation since the options I am using are not referring to "cpu" but to "tasks" and I could not find a proper definition of "cpu" in the framework of the TRES.
Many thanks in advance.
Best, Miriam