[slurm-users] Slurm on POWER9

Wed Sep 12 17:17:03 MDT 2018

Chris,

> > 1.) Slurm seems to be incapable of recognizing sockets/cores/threads on
> > these systems.
> [...]
> > Anyone know if there is a way to get Slurm to recognize the true
topology
> > for POWER nodes?
>
> IIIRC Slurm uses hwloc for discovering topology, so "lstopo-no-graphics"
might
> give you some insights into whether it's showing you the right config.
>
> I'd be curious to see what "lscpu" and "slurmd -C" say as well.

The biggest problem as I see it, is that if I have 2 20-core sockets, if I
have SMT2 set this looks like 80 single-core, single-thread sockets to
Slurm (see slurmd -C output below). If I have SMT4 set, it thinks there are
160 sockets.

NodeName=enki13 CPUs=80 Boards=1 SocketsPerBoard=80 CoresPerSocket=1
ThreadsPerCore=1 RealMemory=583992

UpTime=0-23:20:16

How do you set your configuration for Slurm to get meaningful CPU affinity
for, say, placing tasks on 2 cores per socket (instead of scheduling 4
cores on one socket)?

For SMT2, lscpu output looks like this:

Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157
Off-line CPU(s) list:
2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111,114,115,118,119,122,123,126,127,130,131,134,135,138,139,142,143,146,147,150,151,154,155,158,159
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          6
Model:                 2.2 (pvr 004e 1202)
Model name:            POWER9, altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):
0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77
NUMA node8 CPU(s):
80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157
...

For SMT4, it looks like this:

Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    4
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          6
Model:                 2.2 (pvr 004e 1202)
Model name:            POWER9, altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-79
NUMA node8 CPU(s):     80-159

>
> > 2.) Another concern is the gres.conf. Slurm seems to have trouble taking
> > processor ID's that are > "#Sockets". The true processor ID as given by
> > nvidia-smi topo -m output will range up to 159, and slurm doesn't like
> > this. Are we to use "Cores=" entries in gres.conf, and use the number of
> > the physical cores, instead of what nvidia-smi outputs?
>
> Again I *think* Slurm is using hwloc's logical CPU numbering for this, so
> lstopo should help - using a quick snippet on my local PC (HT enabled)
here:
>
>   Package L#0 + L3 L#0 (8192KB)
>     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>       PU L#0 (P#0)
>       PU L#1 (P#4)
>     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>       PU L#2 (P#1)
>       PU L#3 (P#5)
>     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>       PU L#4 (P#2)
>       PU L#5 (P#6)
>     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>       PU L#6 (P#3)
>       PU L#7 (P#7)
>
> you can see that the logical numbering (L#0 and L#1) is done to be
contiguous
> compared to how the firmware has enumerated the CPUs.
>
> > 3.) A related gres.conf question: there seems to be no documentation of
> > using "CPUs=" instead of "Cores=", yet I have seen several online
examples
> > using "CPUs=" (and I myself have used it on an x86 system without
issue).
> > Should one use "Cores" instead of "CPUs", when specifying binding to
> > specific GPUs?
>
> I think CPUs= was the older syntax which has been replaced with Cores=.
>
> The gres.conf we use on our HPC cluster uses Cores= quite happily.
>
> Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17
> Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35

We will try setting Cores as numbered by "Core L#n" and see how that works
for us. We are using cgroup enforcement, so for a particular user job, they
will only see the GPUs they allocate, and I expect that output of
"nvidia-smi topo -m" will be similarly affected, in that the cores/threads
listed in the output will just be sequential IDs for the cores/threads
requested, not the P# ID's reported if "nvidia-smi topo -m" is run by root
outside of a slurm-controlled job.

SMT lstopo output looks like this:

Machine (570GB total)
  Group0 L#0
    NUMANode L#0 (P#0 252GB)
      Package L#0
        L3 L#0 (10MB) + L2 L#0 (512KB)
          L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
            PU L#0 (P#0)
            PU L#1 (P#1)
          L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
            PU L#2 (P#4)
            PU L#3 (P#5)
        L3 L#1 (10MB) + L2 L#1 (512KB)
          L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
            PU L#4 (P#8)
            PU L#5 (P#9)
          L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
            PU L#6 (P#12)
            PU L#7 (P#13)
...
        L3 L#9 (10MB) + L2 L#9 (512KB)
          L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
            PU L#36 (P#72)
            PU L#37 (P#73)
          L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
            PU L#38 (P#76)
            PU L#39 (P#77)
...
    NUMANode L#1 (P#8 256GB)
      Package L#1
        L3 L#10 (10MB) + L2 L#10 (512KB)
          L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
            PU L#40 (P#80)
            PU L#41 (P#81)
          L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
            PU L#42 (P#84)
            PU L#43 (P#85)
...
        L3 L#19 (10MB) + L2 L#19 (512KB)
          L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
            PU L#76 (P#152)
            PU L#77 (P#153)
          L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
            PU L#78 (P#156)
            PU L#79 (P#157)

So my guess here is that GPU0,GPU1 would get Cores=0-19, and GPU2,GPU3 get
Cores=20-39 as numbered by lstopo?

- Keith Ball
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180912/91646554/attachment-0001.html>