[slurm-users] Strange error, submission denied
Marcus Wagner
wagner at itc.rwth-aachen.de
Wed Feb 20 08:54:45 UTC 2019
Hi Chris,
I assume, you have not set
CR_ONE_TASK_PER_CORE
CR_ONE_TASK_PER_CORE
Allocate one task per core by default.
Without this option, by default one task will be allocated per thread on
nodes with more than one ThreadsPerCore configured. NOTE: This option
cannot be used with
CR_CPU*.
$> scontrol show config | grep CR_ONE_TASK_PER_CORE
SelectTypeParameters = CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE
$> srun --export=all -N 1 --ntasks-per-node=24 hostname | uniq -c
srun: error: CPU count per node can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration
is not available
I even reconfigured one node, such that there is no difference between
the slurmd -C output and the config.
nodeconfig lnm596:
NodeName=lnm596 CPUs=48 Sockets=2 CoresPerSocket=12
ThreadsPerCore=2 RealMemory=120000
Feature=bwx2650,hostok,hpcwork Weight=10430
State=UNKNOWN
The result is still the same.
Seems to be related to the parameter CR_ONE_TASK_PER_CORE
... short testing ...
OK, it IS related to this parameter.
But now slurm distributes the tasks fairly unlucky onto the hosts.
The background is, that we wanted to have only one task per core,
exactly what CR_ONE_TASK_PER_CORE promises to do.
So, normally, I would let the user ask at max half of the number of
CPUs, so one typical job would look like
sbatch -p test -n 24 -w lnm596 --wrap "srun --cpu-bind=verbose ./mpitest.sh"
resulting in a job, which uses both sockets (good!) but only half of the
cores of the socket, as it uses the first 6 cores and their hyperthreads :
cpuinfo of the host:
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,3,4,5,8,9,10,11,12,13
(0,24)(1,25)(2,26)(3,27)(4,28)(5,29)(6,30)(7,31)(8,32)(9,33)(10,34)(11,35)
1 0,1,2,3,4,5,8,9,10,11,12,13
(12,36)(13,37)(14,38)(15,39)(16,40)(17,41)(18,42)(19,43)(20,44)(21,45)(22,46)(23,47)
Output from job:
cpu-bind=MASK - lnm596, task 20 20 [8572]: mask 0x20 set
cpu-bind=MASK - lnm596, task 4 4 [8556]: mask 0x2 set
cpu-bind=MASK - lnm596, task 3 3 [8555]: mask 0x1000000000 set
cpu-bind=MASK - lnm596, task 12 12 [8564]: mask 0x8 set
cpu-bind=MASK - lnm596, task 2 2 [8554]: mask 0x1000000 set
cpu-bind=MASK - lnm596, task 9 9 [8561]: mask 0x4000 set
cpu-bind=MASK - lnm596, task 10 10 [8562]: mask 0x4000000 set
cpu-bind=MASK - lnm596, task 15 15 [8567]: mask 0x8000000000 set
cpu-bind=MASK - lnm596, task 18 18 [8570]: mask 0x10000000 set
cpu-bind=MASK - lnm596, task 7 7 [8559]: mask 0x2000000000 set
cpu-bind=MASK - lnm596, task 1 1 [8553]: mask 0x1000 set
cpu-bind=MASK - lnm596, task 6 6 [8558]: mask 0x2000000 set
cpu-bind=MASK - lnm596, task 8 8 [8560]: mask 0x4 set
cpu-bind=MASK - lnm596, task 14 14 [8566]: mask 0x8000000 set
cpu-bind=MASK - lnm596, task 21 21 [8573]: mask 0x20000 set
cpu-bind=MASK - lnm596, task 5 5 [8557]: mask 0x2000 set
cpu-bind=MASK - lnm596, task 0 0 [8552]: mask 0x1 set
cpu-bind=MASK - lnm596, task 11 11 [8563]: mask 0x4000000000 set
cpu-bind=MASK - lnm596, task 13 13 [8565]: mask 0x8000 set
cpu-bind=MASK - lnm596, task 16 16 [8568]: mask 0x10 set
cpu-bind=MASK - lnm596, task 17 17 [8569]: mask 0x10000 set
cpu-bind=MASK - lnm596, task 19 19 [8571]: mask 0x10000000000 set
cpu-bind=MASK - lnm596, task 22 22 [8574]: mask 0x20000000 set
cpu-bind=MASK - lnm596, task 23 23 [8575]: mask 0x20000000000 set
lnm596.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p1 +pemap 1
lnm596.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p1 +pemap 14
lnm596.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE: <#> unlimited+p1 +pemap 3
lnm596.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE: <#> unlimited+p1 +pemap 27
lnm596.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE: <#> unlimited+p1 +pemap 39
lnm596.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE: <#> unlimited+p1 +pemap 4
lnm596.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE: <#> unlimited+p1 +pemap 28
lnm596.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE: <#> unlimited+p1 +pemap 17
lnm596.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE: <#> unlimited+p1 +pemap 29
lnm596.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p1 +pemap 38
lnm596.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE: <#> unlimited+p1 +pemap 5
lnm596.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE: <#> unlimited+p1 +pemap 15
lnm596.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE: <#> unlimited+p1 +pemap 41
lnm596.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p1 +pemap 2
lnm596.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p1 +pemap 26
lnm596.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p1 +pemap 24
lnm596.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p1 +pemap 25
lnm596.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p1 +pemap 0
lnm596.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p1 +pemap 36
lnm596.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p1 +pemap 12
lnm596.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p1 +pemap 13
lnm596.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p1 +pemap 37
lnm596.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE: <#> unlimited+p1 +pemap 40
lnm596.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE: <#> unlimited+p1 +pemap 16
What we wanted to achieve, and what went besides the --ntask-per-node
problem very well, was to schedule by core, putting only one task onto
one core.
The cgroups contain the cores and the hyperthreads, taskaffinity plugin
gives each task one core together with its hyperthread. So we schedule
by core and the user gets for free to according hyperthread. Perfect!
Exactly, what we wanted (submitted again with the unmodified nodes:
$> sbatch -p test -n 48 --wrap "srun --cpu-bind=verbose ./mpitest.sh"
ncm0400.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#> unlimited+p2
+pemap 27,75
ncm0400.hpc.itc.rwth-aachen.de <27> OMP_STACKSIZE: <#> unlimited+p2
+pemap 39,87
ncm0400.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#> unlimited+p2
+pemap 26,74
ncm0400.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#> unlimited+p2
+pemap 2,50
ncm0400.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#> unlimited+p2
+pemap 29,77
ncm0400.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#> unlimited+p2
+pemap 24,72
ncm0400.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#> unlimited+p2
+pemap 4,52
ncm0400.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#> unlimited+p2
+pemap 0,48
ncm0400.hpc.itc.rwth-aachen.de <44> OMP_STACKSIZE: <#> unlimited+p2
+pemap 20,68
ncm0400.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#> unlimited+p2
+pemap 1,49
ncm0400.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#> unlimited+p2
+pemap 25,73
ncm0400.hpc.itc.rwth-aachen.de <26> OMP_STACKSIZE: <#> unlimited+p2
+pemap 36,84
ncm0400.hpc.itc.rwth-aachen.de <45> OMP_STACKSIZE: <#> unlimited+p2
+pemap 23,71
ncm0400.hpc.itc.rwth-aachen.de <47> OMP_STACKSIZE: <#> unlimited+p2
+pemap 47,95
ncm0400.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE: <#> unlimited+p2
+pemap 9,57
ncm0400.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#> unlimited+p2
+pemap 3,51
ncm0400.hpc.itc.rwth-aachen.de <24> OMP_STACKSIZE: <#> unlimited+p2
+pemap 12,60
ncm0400.hpc.itc.rwth-aachen.de <32> OMP_STACKSIZE: <#> unlimited+p2
+pemap 14,62
ncm0400.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE: <#> unlimited+p2
+pemap 6,54
ncm0400.hpc.itc.rwth-aachen.de <25> OMP_STACKSIZE: <#> unlimited+p2
+pemap 15,63
ncm0400.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE: <#> unlimited+p2
+pemap 32,80
ncm0400.hpc.itc.rwth-aachen.de <34> OMP_STACKSIZE: <#> unlimited+p2
+pemap 38,86
ncm0400.hpc.itc.rwth-aachen.de <46> OMP_STACKSIZE: <#> unlimited+p2
+pemap 44,92
ncm0400.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#> unlimited+p2
+pemap 28,76
ncm0400.hpc.itc.rwth-aachen.de <35> OMP_STACKSIZE: <#> unlimited+p2
+pemap 41,89
ncm0400.hpc.itc.rwth-aachen.de <31> OMP_STACKSIZE: <#> unlimited+p2
+pemap 40,88
ncm0400.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE: <#> unlimited+p2
+pemap 34,82
ncm0400.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE: <#> unlimited+p2
+pemap 30,78
ncm0400.hpc.itc.rwth-aachen.de <33> OMP_STACKSIZE: <#> unlimited+p2
+pemap 17,65
ncm0400.hpc.itc.rwth-aachen.de <28> OMP_STACKSIZE: <#> unlimited+p2
+pemap 13,61
ncm0400.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE: <#> unlimited+p2
+pemap 8,56
ncm0400.hpc.itc.rwth-aachen.de <37> OMP_STACKSIZE: <#> unlimited+p2
+pemap 21,69
ncm0400.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE: <#> unlimited+p2
+pemap 11,59
ncm0400.hpc.itc.rwth-aachen.de <36> OMP_STACKSIZE: <#> unlimited+p2
+pemap 18,66
ncm0400.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE: <#> unlimited+p2
+pemap 33,81
ncm0400.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#> unlimited+p2
+pemap 5,53
ncm0400.hpc.itc.rwth-aachen.de <30> OMP_STACKSIZE: <#> unlimited+p2
+pemap 37,85
ncm0400.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE: <#> unlimited+p2
+pemap 7,55
ncm0400.hpc.itc.rwth-aachen.de <43> OMP_STACKSIZE: <#> unlimited+p2
+pemap 46,94
ncm0400.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE: <#> unlimited+p2
+pemap 35,83
ncm0400.hpc.itc.rwth-aachen.de <40> OMP_STACKSIZE: <#> unlimited+p2
+pemap 19,67
ncm0400.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE: <#> unlimited+p2
+pemap 10,58
ncm0400.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE: <#> unlimited+p2
+pemap 31,79
ncm0400.hpc.itc.rwth-aachen.de <41> OMP_STACKSIZE: <#> unlimited+p2
+pemap 22,70
ncm0400.hpc.itc.rwth-aachen.de <29> OMP_STACKSIZE: <#> unlimited+p2
+pemap 16,64
ncm0400.hpc.itc.rwth-aachen.de <38> OMP_STACKSIZE: <#> unlimited+p2
+pemap 42,90
ncm0400.hpc.itc.rwth-aachen.de <39> OMP_STACKSIZE: <#> unlimited+p2
+pemap 45,93
ncm0400.hpc.itc.rwth-aachen.de <42> OMP_STACKSIZE: <#> unlimited+p2
+pemap 43,91
cpuinfo of this node:
Package Id. Core Id. Processors
0 0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29
(0,48)(1,49)(2,50)(3,51)(4,52)(5,53)(6,54)(7,55)(8,56)(9,57)(10,58)(11,59)(12,60)(13,61)(14,62)(15,63)(16,64)(17,65)(18,66)(19,67)(20,68)(21,69)(22,70)(23,71)
1 0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29
(24,72)(25,73)(26,74)(27,75)(28,76)(29,77)(30,78)(31,79)(32,80)(33,81)(34,82)(35,83)(36,84)(37,85)(38,86)(39,87)(40,88)(41,89)(42,90)(43,91)(44,92)(45,93)(46,94)(47,95)
Best
Marcus
On 2/20/19 7:49 AM, Chris Samuel wrote:
> On Tuesday, 19 February 2019 10:14:21 PM PST Marcus Wagner wrote:
>
>> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
>> submission denied, got jobid 199805
> On one of our 40 core nodes with 2 hyperthreads:
>
> $ srun -C gpu -N 1 --ntasks-per-node=80 hostname | uniq -c
> 80 nodename02
>
> The spec is:
>
> CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
>
> Hope this helps!
>
> All the best,
> Chris
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190220/94465406/attachment-0001.html>
More information about the slurm-users
mailing list