[slurm-users] Slurm on POWER9
Keith Ball
bipcuds at gmail.com
Wed Sep 12 17:17:03 MDT 2018
Chris,
> > 1.) Slurm seems to be incapable of recognizing sockets/cores/threads on
> > these systems.
> [...]
> > Anyone know if there is a way to get Slurm to recognize the true
topology
> > for POWER nodes?
>
> IIIRC Slurm uses hwloc for discovering topology, so "lstopo-no-graphics"
might
> give you some insights into whether it's showing you the right config.
>
> I'd be curious to see what "lscpu" and "slurmd -C" say as well.
The biggest problem as I see it, is that if I have 2 20-core sockets, if I
have SMT2 set this looks like 80 single-core, single-thread sockets to
Slurm (see slurmd -C output below). If I have SMT4 set, it thinks there are
160 sockets.
NodeName=enki13 CPUs=80 Boards=1 SocketsPerBoard=80 CoresPerSocket=1
ThreadsPerCore=1 RealMemory=583992
UpTime=0-23:20:16
How do you set your configuration for Slurm to get meaningful CPU affinity
for, say, placing tasks on 2 cores per socket (instead of scheduling 4
cores on one socket)?
For SMT2, lscpu output looks like this:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list:
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157
Off-line CPU(s) list:
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111,114,115,118,119,122,123,126,127,130,131,134,135,138,139,142,143,146,147,150,151,154,155,158,159
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 6
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s):
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77
NUMA node8 CPU(s):
80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157
...
For SMT4, it looks like this:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 4
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 6
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-79
NUMA node8 CPU(s): 80-159
>
> > 2.) Another concern is the gres.conf. Slurm seems to have trouble taking
> > processor ID's that are > "#Sockets". The true processor ID as given by
> > nvidia-smi topo -m output will range up to 159, and slurm doesn't like
> > this. Are we to use "Cores=" entries in gres.conf, and use the number of
> > the physical cores, instead of what nvidia-smi outputs?
>
> Again I *think* Slurm is using hwloc's logical CPU numbering for this, so
> lstopo should help - using a quick snippet on my local PC (HT enabled)
here:
>
> Package L#0 + L3 L#0 (8192KB)
> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
> PU L#0 (P#0)
> PU L#1 (P#4)
> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
> PU L#2 (P#1)
> PU L#3 (P#5)
> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
> PU L#4 (P#2)
> PU L#5 (P#6)
> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
> PU L#6 (P#3)
> PU L#7 (P#7)
>
> you can see that the logical numbering (L#0 and L#1) is done to be
contiguous
> compared to how the firmware has enumerated the CPUs.
>
> > 3.) A related gres.conf question: there seems to be no documentation of
> > using "CPUs=" instead of "Cores=", yet I have seen several online
examples
> > using "CPUs=" (and I myself have used it on an x86 system without
issue).
> > Should one use "Cores" instead of "CPUs", when specifying binding to
> > specific GPUs?
>
> I think CPUs= was the older syntax which has been replaced with Cores=.
>
> The gres.conf we use on our HPC cluster uses Cores= quite happily.
>
> Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17
> Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35
We will try setting Cores as numbered by "Core L#n" and see how that works
for us. We are using cgroup enforcement, so for a particular user job, they
will only see the GPUs they allocate, and I expect that output of
"nvidia-smi topo -m" will be similarly affected, in that the cores/threads
listed in the output will just be sequential IDs for the cores/threads
requested, not the P# ID's reported if "nvidia-smi topo -m" is run by root
outside of a slurm-controlled job.
SMT lstopo output looks like this:
Machine (570GB total)
Group0 L#0
NUMANode L#0 (P#0 252GB)
Package L#0
L3 L#0 (10MB) + L2 L#0 (512KB)
L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#4)
PU L#3 (P#5)
L3 L#1 (10MB) + L2 L#1 (512KB)
L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#8)
PU L#5 (P#9)
L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#12)
PU L#7 (P#13)
...
L3 L#9 (10MB) + L2 L#9 (512KB)
L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#72)
PU L#37 (P#73)
L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#76)
PU L#39 (P#77)
...
NUMANode L#1 (P#8 256GB)
Package L#1
L3 L#10 (10MB) + L2 L#10 (512KB)
L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#80)
PU L#41 (P#81)
L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#84)
PU L#43 (P#85)
...
L3 L#19 (10MB) + L2 L#19 (512KB)
L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
PU L#76 (P#152)
PU L#77 (P#153)
L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
PU L#78 (P#156)
PU L#79 (P#157)
So my guess here is that GPU0,GPU1 would get Cores=0-19, and GPU2,GPU3 get
Cores=20-39 as numbered by lstopo?
- Keith Ball
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180912/91646554/attachment-0001.html>
More information about the slurm-users
mailing list