[slurm-users] Strange error, submission denied
Marcus Wagner
wagner at itc.rwth-aachen.de
Thu Feb 21 06:17:08 UTC 2019
Hi Andreas,
I'll try to sum this up ;)
First of all, I used now a Broadwell node, so there is no interference
with Skylake and SubNuma clustering.
We are using slurm 18.08.5-2
I have configured the node as slurmd -C tells me:
NodeName=lnm596 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2
RealMemory=120000 Feature=bwx2650,hostok,hpcwork
Weight=10430 State=UNKNOWN
This is, what slurmctld knows about the node:
NodeName=lnm596 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUTot=48 CPULoad=0.03
AvailableFeatures=bwx2650,hostok,hpcwork
ActiveFeatures=bwx2650,hostok,hpcwork
Gres=(null)
GresDrain=N/A
GresUsed=gpu:0
NodeAddr=lnm596 NodeHostName=lnm596 Version=18.08
OS=Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019
RealMemory=120000 AllocMem=0 FreeMem=125507 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=10430 Owner=N/A
MCS_label=N/A
Partitions=future
BootTime=2019-02-19T07:43:33 SlurmdStartTime=2019-02-20T12:08:54
CfgTRES=cpu=48,mem=120000M,billing=48
AllocTRES=
CapWatts=n/a
CurrentWatts=120 LowestJoules=714879 ConsumedJoules=8059263
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Lets first begin with half of the node:
--ntasks=12 -> 12 CPUs asked. I implicitly get the hyperthread for free
(besides the accounting).
NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,mem=120000M,energy=46,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=120000M MinTmpDiskNode=0
--ntasks=12 --cpus-per-tasks=2 -> 24 CPUs asked. I now have explicitly
asked for 24 CPUs
NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,mem=120000M,energy=55,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=2 MinMemoryNode=120000M MinTmpDiskNode=0
--ntasks=12 --ntasks-per-node=12 --cpus-per-tasks=2 -> 24 CPUs asked.
Additional constraint: All 12 tasks should be on one node. I also asked
here for 24 CPUs.
NumNodes=1 NumCPUs=24 NumTasks=12 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,mem=120000M,energy=55,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=12:0:*:1 CoreSpec=*
MinCPUsNode=24 MinMemoryNode=120000M MinTmpDiskNode=0
Everything good up to now. Now I'll try to use the full node:
--ntasks=24 -> 24 CPUs asked, implicitly got 48.
NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=120000M MinTmpDiskNode=0
--ntasks=24 --cpus-per-tasks=2 -> 48 CPUs explicitly asked.
NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=2 MinMemoryNode=120000M MinTmpDiskNode=0
And now the funny thing, I don't understand
--ntasks=24 --ntasks-per-node=24 --cpus-per-tasks=2 -> 48 CPUs asked,
all 24 tasks on one node. Slurm tells me:
sbatch: error: Batch job submission failed: Requested node configuration
is not available
I would have expected the following job, which would have fit onto the node:
NumNodes=1 NumCPUs=48 NumTasks=24 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=120000M,energy=62,node=1,billing=48
Socks/Node=* NtasksPerN:B:S:C=24:0:*:1 CoreSpec=*
MinCPUsNode=48 MinMemoryNode=120000M MinTmpDiskNode=0
part of the sbatch -vvv output:
sbatch: ntasks : 24 (set)
sbatch: cpus_per_task : 2
sbatch: nodes : 1 (set)
sbatch: sockets-per-node : -2
sbatch: cores-per-socket : -2
sbatch: threads-per-core : -2
sbatch: ntasks-per-node : 24
sbatch: ntasks-per-socket : -2
sbatch: ntasks-per-core : -2
So, again, I see 24 tasks per node, 2 cpus per task and 1 node. This is
altogether 48 CPUs on one node. Which fits perfectly, as one can see
with the last two examples
Sprich 24 tasks pro Knoten, 2 cpus pro task, 1 Knoten. Macht bei mir
immer noch 48 CPUs.
I just ask explicitly what slurm already gives me implicitly, or have I
understood something wrong.
We will have to look into this further internally. Might be we have to
give up CR_ONE_TASK_PER_CORE.
Best
Marcus
P.S.:
Sorry for the lengthy post
On 2/20/19 11:59 AM, Henkel wrote:
> Hi Chris,
> Hi Marcus,
>
> Just want to understand the cause, too. I'll try to sum it up.
>
> Chris you have
>
> CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
>
> and
>
> srun -C gpu -N 1 --ntasks-per-node=80 hostname
>
> works.
>
> Marcus has configured
>
> CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2
> (slurmd -C says CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12
> ThreadsPerCore=2)
>
> and
>
> CR_ONE_TASK_PER_CORE
>
> and
>
> srun -n 48 WORKS
>
> srun -N 1 --ntasks-per-node=48 DOESN'T WORK.
>
> I'm not sure if it's caused by CR_ONE_TASK_PER_CORE but at least that's
> one of the major differences. I'm wondering if the effort to force using
> only physical cores is doubled by removing the 48 Threads AND setting
> CR_ONE_TAKS_PER_CORE. My impression is that with CR_ONE_TASK_PER_CORE
> ntasks-per-node accounts for threads (you have set ThreadsPerCore=2),
> hence only 24 may work but CR_ONE_TASK_PER_CORE doen't affect the
> selection of 'cores only' with ntasks.
>
> We don't use CR_ONE_TASK_PER_CORE but our users either set -c 2 or
> --hint=nomultithread, which results in core-only.
>
> You could also enforce this with a job-submit-plugin or lua-plugin.
>
> The fact that CR_ONE_TASK_PER_CORE is described as "under changed" in
> the public bugs and that there is a non-accessible bug about this
> probably points to better not use this unless you have to.
>
> Best,
>
> Andreas
>
> On 2/20/19 7:49 AM, Chris Samuel wrote:
>> On Tuesday, 19 February 2019 10:14:21 PM PST Marcus Wagner wrote:
>>
>>> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
>>> submission denied, got jobid 199805
>> On one of our 40 core nodes with 2 hyperthreads:
>>
>> $ srun -C gpu -N 1 --ntasks-per-node=80 hostname | uniq -c
>> 80 nodename02
>>
>> The spec is:
>>
>> CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
>>
>> Hope this helps!
>>
>> All the best,
>> Chris
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
More information about the slurm-users
mailing list