Hi Andreas,
many thanks for your reply and for the link to the definition page
of slurm documentation!
Regarding my example, I still have a question: why do you assume
that tasks=cpus?
From the definition of cpu in the documentation I understand that
cpus might refers to threads since I defined threads in the config
file of my cluster, but I cannot really get the reason of the
association cpus=tasks.
Can you help me with this?
Many thanks in advance!
Best,
Miriam
On 04/12/24 18:28, Henkel, Andreas via
slurm-users wrote:
Hi Miriam,
The Definition of cpu is “fluid” . It depends on
hardware and configuration. If threads are defined then cpu may
relate to one thread whereas on hardware configurations without
threads it will refer to a physical core.
https://slurm.schedmd.com/mc_support.html#defs
Didn’t you set mintres to be cpu=33? Therefore a
job asking for 12 tasks(=cpus) has to be rejected, doesn’t it?
Best,
Andreas
Hi all,
I cannot understand the true difference and definition of
"core", "task" and "cpu" within the limits associated to a
partition via the TRES variable of a QOS.
More precisely I have 2 partitions defined as follows:
PartitionName=lprod
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=lprod_part
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=r037c01s[01-12]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=432 TotalNodes=12
SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=432,mem=12M,node=12,billing=432
PartitionName=bprod
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=bprod_part
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=r037c01s[01-12]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=432 TotalNodes=12
SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=432,mem=12M,node=12,billing=432
with the two qos defined as:
Name Priority GraceTime PreemptMode
UsageFactor MaxJobsPU MaxSubmitPU
MaxTRES
MinTRES Flags
-------------------- ---------- ---------- -----------
----------- --------- ----------- --------------------
--------------------
--------------------------------------------------
lprod_part 0 00:00:00 cluster
1.000000 100 120
cpu=32,mem=366G cpu=1
DenyOnLimit,PartitionMaxNodes,PartitionMinNodes
bprod_part 0 00:00:00 cluster
1.000000 100 120
cpu=64,mem=366G
cpu=33 DenyOnLimit,OverPartQOS
If I try to submit a job on the lprod partition with the
directives:
#SBATCH --nodes=12
#SBATCH --ntasks-per-node=1
the job is executed correctly while it is not executed if
submitted on the bprod partition due to the error:
sbatch: error: QOSMinCpuNotSatisfied
sbatch: error: Batch job submission failed: Job violates
accounting/QOS policy (job submit limit, user's size and/or
time limits)
I understand that this is related to the limit of cpu
associated to the partition via the qos:
lprod->[1-32]cpus, bprod->[33-64]cpus
but I would like to have a more proper explanation since the
options I am using are not referring to "cpu" but to "tasks"
and I could not find
a proper definition of "cpu" in the framework of the TRES.
Many thanks in advance.
Best,
Miriam
--
slurm-users mailing list --
slurm-users@lists.schedmd.com
To unsubscribe send an email to
slurm-users-leave@lists.schedmd.com
--
*******************************************************
Miriam Olmi
Computing & Network Service
Laboratori Nazionali del Gran Sasso - INFN
Via G. Acitelli, 22
67100 Assergi (AQ) Italy
https://www.lngs.infn.it
✉ email: miriam.olmi@lngs.infn.it
☎ office: +39 0862 437222
*******************************************************