<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi Marcus, <br>
</p>
<p>for us slurmd -C as well as numactl -H looked fine, too. But
we're using task/cgroup only and every job starting on a skylake
node gave us <br>
</p>
<pre class="code highlight"><code><span id="LC770" class="line" lang="c"><span class="n">error</span><span class="p">(</span><span class="s">"task/cgroup: task[%u] infinite loop broken while trying "</span></span>
<span id="LC771" class="line" lang="c"> <span class="s">"to provision compute elements using %s (bitmap:%s)"</span><span class="p">,
</span></span></code></pre>
<p>from src/plugins/task/cgroup/task_cgroup_cpuset.c and the process
placement was wrong. <br>
</p>
<p>Once we deactivated subnuma everythings running fine. <br>
</p>
<p>But for completeness: I tested that on Slurm 17 (and maybe the
core was partly 16 at that time). We're using Slurm 17.11.13 and
I'll check the behavior there in the next days. <br>
I'm hestitant to switch to 18 because of the latest bugs that
appeared with every minor release. <br>
</p>
<p>Best, <br>
</p>
<p>Andreas</p>
<p><br>
</p>
<pre class="code highlight"><code><span id="LC771" class="line" lang="c"><span class="p"></span></span></code></pre>
<div class="moz-cite-prefix">On 14.02.19 12:54, Marcus Wagner wrote:<br>
</div>
<blockquote type="cite"
cite="mid:f67db9cc-d248-9aa3-22bf-a0bf69401bcc@itc.rwth-aachen.de">Hi
Andreas,
<br>
<br>
<br>
as slurmd -C shows, it detects 4 numa-nodes taking these as
sockets. This was also the way, we configured slurm.
<br>
<br>
numactl -H clearly shows the four domains and which belongs to
which socket:
<br>
<br>
node distances:
<br>
node 0 1 2 3
<br>
0: 10 11 21 21
<br>
1: 11 10 21 21
<br>
2: 21 21 10 11
<br>
3: 21 21 11 10
<br>
<br>
<br>
This is fairly the same with hwloc:
<br>
<br>
$> hwloc-distances
<br>
Relative latency matrix between 4 NUMANodes (depth 3) by logical
indexes (below Machine L#0):
<br>
index 0 1 2 3
<br>
0 1.000 1.100 2.100 2.100
<br>
1 1.100 1.000 2.100 2.100
<br>
2 2.100 2.100 1.000 1.100
<br>
3 2.100 2.100 1.100 1.000
<br>
<br>
We use the task/affinity plugin together with task/cgroup, but in
the cgroup.conf set affinity to off, such that the task affinity
plugin is doing the magic.
<br>
We also see slurm configured that way to do a round robin over the
numanodes by default (12 tasks on 48 core machine):
<br>
<br>
ncm0071.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 0,48
<br>
ncm0071.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 3,51
<br>
ncm0071.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 24,72
<br>
ncm0071.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 27,75
<br>
ncm0071.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 1,49
<br>
ncm0071.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 4,52
<br>
ncm0071.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 25,73
<br>
ncm0071.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 28,76
<br>
ncm0071.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 2,50
<br>
ncm0071.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 5,53
<br>
ncm0071.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 26,74
<br>
ncm0071.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 29,77
<br>
<br>
<br>
using #SBATCH -m block:block results in all tasks on one numanode:
<br>
<br>
ncm0071.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 0,48
<br>
ncm0071.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 1,49
<br>
ncm0071.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 2,50
<br>
ncm0071.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 6,54
<br>
ncm0071.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 7,55
<br>
ncm0071.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 8,56
<br>
ncm0071.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 12,60
<br>
ncm0071.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 13,61
<br>
ncm0071.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 14,62
<br>
ncm0071.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 18,66
<br>
ncm0071.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 19,67
<br>
ncm0071.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE: <#>
unlimited+p2 +pemap 20,68
<br>
<br>
<br>
isn't it that, what would be needed, or do I miss something? What
would be "better" with hwloc2?
<br>
<br>
<br>
Besides my original problem, we are fairly happy with slurm so
far, but that one gives me grey hair :/
<br>
<br>
<br>
Best
<br>
Marcus
<br>
<br>
<br>
On 2/14/19 11:27 AM, Henkel, Andreas wrote:
<br>
<blockquote type="cite">Hi Marcus,
<br>
<br>
We have skylake too and it didn’t work for us. We used cgroups
only and process binding went completely havoc with subnuma
enabled.
<br>
While searching for solutions I found that hwloc does support
subnuma only with version > 2 (when looking for skylake in
hwloc you will get hits in version 2 branches only). At least
hwloc 2.x made Numa-blocks children objects whereas hwloc 1.x
has Numablocks as parents only. I think that was the reason why
there was a special branch in hwloc for handling subNuma-layouts
of Xeon Phi.
<br>
But I’ll be happy if you proof me wrong.
<br>
<br>
Best,
<br>
Andreas
<br>
<br>
<blockquote type="cite">Am 14.02.2019 um 09:32 schrieb Marcus
Wagner <a class="moz-txt-link-rfc2396E" href="mailto:wagner@itc.rwth-aachen.de"><wagner@itc.rwth-aachen.de></a>:
<br>
<br>
Hi Andreas,
<br>
<br>
<br>
<br>
<blockquote type="cite">On 2/14/19 8:56 AM, Henkel, Andreas
wrote:
<br>
Hi Marcus,
<br>
<br>
More ideas:
<br>
CPUs doesn’t always count as core but may take the meaning
of one thread, hence makes different
<br>
Maybe the behavior of CR_ONE_TASK is still not solid nor
properly documente and ntasks and ntasks-per-node are
honored different internally. If so solely using ntasks can
mean using alle threads for Slurm even if the binding may be
correct according to binding.
<br>
Obviously in your results Slurm handles the options
differently.
<br>
<br>
Have you tried configuring the node with cpus=96? What
output do you get from slurmd -C?
<br>
</blockquote>
Not yet, as this is not the desired behaviour. We want to
schedule by cores. But I will try that. slurmd -C output is
the following:
<br>
<br>
NodeName=ncm0708 slurmd: Considering each NUMA node as a
socket
<br>
CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12
ThreadsPerCore=2 RealMemory=191905
<br>
UpTime=6-21:30:02
<br>
<br>
<blockquote type="cite">Is this a new architecture like
skylake? In case of subnuma-Layouts Slurm can not handle it
without hwloc2.
<br>
</blockquote>
Yes, we have Skylake and as you can see in the above output,
we have subnuma-clustering enabled. Still, we only use hwloc
coming with CentOS 7: hwloc-1.11.8-4.el7.x86_64
<br>
Where did you get the information, that hwloc2 is needed?
<br>
<blockquote type="cite">Have you tried to use srun -v(vv)
instead of sbatch? Maybe you can get a glimpse of what Slurm
actually does with your options.
<br>
</blockquote>
The only strange thing I can observe is the following:
<br>
srun: threads : 60
<br>
<br>
What threads is srun talking about there?
<br>
Nonetheless, here the full output:
<br>
<br>
$> srun --ntasks=48 --ntasks-per-node=48 -vvv hostname
<br>
srun: defined options for program `srun'
<br>
srun: --------------- ---------------------
<br>
srun: user : `mw445520'
<br>
srun: uid : 40574
<br>
srun: gid : 40574
<br>
srun: cwd :
/rwthfs/rz/cluster/home/mw445520/tests/slurm/cgroup
<br>
srun: ntasks : 48 (set)
<br>
srun: nodes : 1 (default)
<br>
srun: jobid : 4294967294 (default)
<br>
srun: partition : default
<br>
srun: profile : `NotSet'
<br>
srun: job name : `hostname'
<br>
srun: reservation : `(null)'
<br>
srun: burst_buffer : `(null)'
<br>
srun: wckey : `(null)'
<br>
srun: cpu_freq_min : 4294967294
<br>
srun: cpu_freq_max : 4294967294
<br>
srun: cpu_freq_gov : 4294967294
<br>
srun: switches : -1
<br>
srun: wait-for-switches : -1
<br>
srun: distribution : unknown
<br>
srun: cpu-bind : default (0)
<br>
srun: mem-bind : default (0)
<br>
srun: verbose : 3
<br>
srun: slurmd_debug : 0
<br>
srun: immediate : false
<br>
srun: label output : false
<br>
srun: unbuffered IO : false
<br>
srun: overcommit : false
<br>
srun: threads : 60
<br>
srun: checkpoint_dir : /w0/slurm/checkpoint
<br>
srun: wait : 0
<br>
srun: nice : -2
<br>
srun: account : (null)
<br>
srun: comment : (null)
<br>
srun: dependency : (null)
<br>
srun: exclusive : false
<br>
srun: bcast : false
<br>
srun: qos : (null)
<br>
srun: constraints :
<br>
srun: reboot : yes
<br>
srun: preserve_env : false
<br>
srun: network : (null)
<br>
srun: propagate : NONE
<br>
srun: prolog : (null)
<br>
srun: epilog : (null)
<br>
srun: mail_type : NONE
<br>
srun: mail_user : (null)
<br>
srun: task_prolog : (null)
<br>
srun: task_epilog : (null)
<br>
srun: multi_prog : no
<br>
srun: sockets-per-node : -2
<br>
srun: cores-per-socket : -2
<br>
srun: threads-per-core : -2
<br>
srun: ntasks-per-node : 48
<br>
srun: ntasks-per-socket : -2
<br>
srun: ntasks-per-core : -2
<br>
srun: plane_size : 4294967294
<br>
srun: core-spec : NA
<br>
srun: power :
<br>
srun: cpus-per-gpu : 0
<br>
srun: gpus : (null)
<br>
srun: gpu-bind : (null)
<br>
srun: gpu-freq : (null)
<br>
srun: gpus-per-node : (null)
<br>
srun: gpus-per-socket : (null)
<br>
srun: gpus-per-task : (null)
<br>
srun: mem-per-gpu : 0
<br>
srun: remote command : `hostname'
<br>
srun: debug: propagating SLURM_PRIO_PROCESS=0
<br>
srun: debug: propagating UMASK=0007
<br>
srun: debug2: srun PMI messages to port=34521
<br>
srun: debug: Entering slurm_allocation_msg_thr_create()
<br>
srun: debug: port from net_stream_listen is 35465
<br>
srun: debug: Entering _msg_thr_internal
<br>
srun: debug: Munge authentication plugin loaded
<br>
srun: error: CPU count per node can not be satisfied
<br>
srun: error: Unable to allocate resources: Requested node
configuration is not available
<br>
<br>
<br>
<br>
Best
<br>
Marcus
<br>
<br>
<br>
<blockquote type="cite">Best,
<br>
Andreas
<br>
<br>
<br>
<blockquote type="cite">Am 14.02.2019 um 08:34 schrieb
Marcus Wagner <a class="moz-txt-link-rfc2396E" href="mailto:wagner@itc.rwth-aachen.de"><wagner@itc.rwth-aachen.de></a>:
<br>
<br>
Hi Chris,
<br>
<br>
<br>
this are 96 thread nodes with 48 cores. You are right,
that if we set it to 24, the job will get scheduled. But
then, only half of the node is used. On the other side, if
I only use --ntasks=48, slurm schedules all tasks onto the
same node. The hyperthread of each core is included in the
cgroup and the task_affinity plugin also correctly binds
the hyperthread together with the core (small ugly
testscript from us, the last two numbers are the core and
its hyperthread):
<br>
<br>
ncm0728.hpc.itc.rwth-aachen.de <0> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 0,48
<br>
ncm0728.hpc.itc.rwth-aachen.de <10> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 26,74
<br>
ncm0728.hpc.itc.rwth-aachen.de <11> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 29,77
<br>
ncm0728.hpc.itc.rwth-aachen.de <12> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 6,54
<br>
ncm0728.hpc.itc.rwth-aachen.de <13> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 9,57
<br>
ncm0728.hpc.itc.rwth-aachen.de <14> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 30,78
<br>
ncm0728.hpc.itc.rwth-aachen.de <15> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 33,81
<br>
ncm0728.hpc.itc.rwth-aachen.de <16> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 7,55
<br>
ncm0728.hpc.itc.rwth-aachen.de <17> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 10,58
<br>
ncm0728.hpc.itc.rwth-aachen.de <18> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 31,79
<br>
ncm0728.hpc.itc.rwth-aachen.de <19> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 34,82
<br>
ncm0728.hpc.itc.rwth-aachen.de <1> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 3,51
<br>
ncm0728.hpc.itc.rwth-aachen.de <20> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 8,56
<br>
ncm0728.hpc.itc.rwth-aachen.de <21> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 11,59
<br>
ncm0728.hpc.itc.rwth-aachen.de <22> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 32,80
<br>
ncm0728.hpc.itc.rwth-aachen.de <23> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 35,83
<br>
ncm0728.hpc.itc.rwth-aachen.de <24> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 12,60
<br>
ncm0728.hpc.itc.rwth-aachen.de <25> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 15,63
<br>
ncm0728.hpc.itc.rwth-aachen.de <26> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 36,84
<br>
ncm0728.hpc.itc.rwth-aachen.de <27> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 39,87
<br>
ncm0728.hpc.itc.rwth-aachen.de <28> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 13,61
<br>
ncm0728.hpc.itc.rwth-aachen.de <29> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 16,64
<br>
ncm0728.hpc.itc.rwth-aachen.de <2> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 24,72
<br>
ncm0728.hpc.itc.rwth-aachen.de <30> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 37,85
<br>
ncm0728.hpc.itc.rwth-aachen.de <31> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 40,88
<br>
ncm0728.hpc.itc.rwth-aachen.de <32> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 14,62
<br>
ncm0728.hpc.itc.rwth-aachen.de <33> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 17,65
<br>
ncm0728.hpc.itc.rwth-aachen.de <34> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 38,86
<br>
ncm0728.hpc.itc.rwth-aachen.de <35> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 41,89
<br>
ncm0728.hpc.itc.rwth-aachen.de <36> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 18,66
<br>
ncm0728.hpc.itc.rwth-aachen.de <37> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 21,69
<br>
ncm0728.hpc.itc.rwth-aachen.de <38> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 42,90
<br>
ncm0728.hpc.itc.rwth-aachen.de <39> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 45,93
<br>
ncm0728.hpc.itc.rwth-aachen.de <3> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 27,75
<br>
ncm0728.hpc.itc.rwth-aachen.de <40> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 19,67
<br>
ncm0728.hpc.itc.rwth-aachen.de <41> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 22,70
<br>
ncm0728.hpc.itc.rwth-aachen.de <42> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 43,91
<br>
ncm0728.hpc.itc.rwth-aachen.de <43> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 46,94
<br>
ncm0728.hpc.itc.rwth-aachen.de <44> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 20,68
<br>
ncm0728.hpc.itc.rwth-aachen.de <45> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 23,71
<br>
ncm0728.hpc.itc.rwth-aachen.de <46> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 44,92
<br>
ncm0728.hpc.itc.rwth-aachen.de <47> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 47,95
<br>
ncm0728.hpc.itc.rwth-aachen.de <4> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 1,49
<br>
ncm0728.hpc.itc.rwth-aachen.de <5> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 4,52
<br>
ncm0728.hpc.itc.rwth-aachen.de <6> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 25,73
<br>
ncm0728.hpc.itc.rwth-aachen.de <7> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 28,76
<br>
ncm0728.hpc.itc.rwth-aachen.de <8> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 2,50
<br>
ncm0728.hpc.itc.rwth-aachen.de <9> OMP_STACKSIZE:
<#> unlimited+p2 +pemap 5,53
<br>
<br>
<br>
--ntasks=48:
<br>
<br>
NodeList=ncm0728
<br>
BatchHost=ncm0728
<br>
NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1
ReqB:S:C:T=0:0:*:*
<br>
TRES=cpu=48,mem=182400M,node=1,billing=48
<br>
<br>
<br>
--ntasks=48
<br>
--ntasks-per-node=24:
<br>
<br>
NodeList=ncm[0438-0439]
<br>
BatchHost=ncm0438
<br>
NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1
ReqB:S:C:T=0:0:*:*
<br>
TRES=cpu=48,mem=182400M,node=2,billing=48
<br>
<br>
<br>
--ntasks=48
<br>
--ntasks-per-node=48:
<br>
<br>
sbatch: error: CPU count per node can not be satisfied
<br>
sbatch: error: Batch job submission failed: Requested node
configuration is not available
<br>
<br>
<br>
Isn't the first essentially the same as the last, with the
difference, that I want to force slurm to put all tasks
onto one node?
<br>
<br>
<br>
<br>
Best
<br>
Marcus
<br>
<br>
<br>
<blockquote type="cite">
<blockquote type="cite">On 2/14/19 7:15 AM, Chris Samuel
wrote:
<br>
On Wednesday, 13 February 2019 4:48:05 AM PST Marcus
Wagner wrote:
<br>
<br>
#SBATCH --ntasks-per-node=48
<br>
</blockquote>
I wouldn't mind betting is that if you set that to 24 it
will work, and each
<br>
thread will be assigned a single core with the 2 thread
units on it.
<br>
<br>
All the best,
<br>
Chris
<br>
</blockquote>
-- <br>
Marcus Wagner, Dipl.-Inf.
<br>
<br>
IT Center
<br>
Abteilung: Systeme und Betrieb
<br>
RWTH Aachen University
<br>
Seffenter Weg 23
<br>
52074 Aachen
<br>
Tel: +49 241 80-24383
<br>
Fax: +49 241 80-624383
<br>
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<br>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
<br>
<br>
<br>
</blockquote>
</blockquote>
-- <br>
Marcus Wagner, Dipl.-Inf.
<br>
<br>
IT Center
<br>
Abteilung: Systeme und Betrieb
<br>
RWTH Aachen University
<br>
Seffenter Weg 23
<br>
52074 Aachen
<br>
Tel: +49 241 80-24383
<br>
Fax: +49 241 80-624383
<br>
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<br>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
<br>
<br>
<br>
</blockquote>
</blockquote>
<br>
</blockquote>
</body>
</html>