<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Chris,</div><div><br></div><div dir="ltr">> > 1.) Slurm seems to be incapable of recognizing sockets/cores/threads on<br>> > these systems.<br>> [...]<br>> > Anyone know if there is a way to get Slurm to recognize the true topology<br>> > for POWER nodes?<br>> <br>> IIIRC Slurm uses hwloc for discovering topology, so "lstopo-no-graphics" might<br>> give you some insights into whether it's showing you the right config.<br>> <br>> I'd be curious to see what "lscpu" and "slurmd -C" say as well.</div><div dir="ltr"><br></div><div>The biggest problem as I see it, is that if I have 2 20-core sockets, if I have SMT2 set this looks like 80 single-core, single-thread sockets to Slurm (see slurmd -C output below). If I have SMT4 set, it thinks there are 160 sockets. <br></div><div><br></div><div><p class="MsoNormal">NodeName=enki13 CPUs=80 Boards=1 SocketsPerBoard=80 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=583992</p>
<p class="MsoNormal">UpTime=<span class="gmail-aBn" tabindex="0"><span class="gmail-aQJ">0-23:20:16</span></span></p><p class="MsoNormal"><span class="gmail-aBn" tabindex="0"><span class="gmail-aQJ"><br></span></span></p><p class="MsoNormal"><span class="gmail-aBn" tabindex="0"><span class="gmail-aQJ">How do you set your configuration for Slurm to get meaningful CPU affinity for, say, placing tasks on 2 cores per socket (instead of scheduling 4 cores on one socket)? <br></span></span></p><p class="MsoNormal"><span class="gmail-aBn" tabindex="0"><span class="gmail-aQJ"><br></span></span></p><p class="MsoNormal"><span class="gmail-aBn" tabindex="0"><span class="gmail-aQJ">For SMT2, lscpu output looks like this:</span></span></p><p class="MsoNormal"><span class="gmail-aBn" tabindex="0"><span class="gmail-aQJ"><br></span></span></p><p class="MsoNormal"><span class="gmail-aBn" tabindex="0"><span class="gmail-aQJ">Architecture: ppc64le<br>Byte Order: Little Endian<br>CPU(s): 160<br>On-line CPU(s) list: 0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157<br>Off-line CPU(s) list: 2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111,114,115,118,119,122,123,126,127,130,131,134,135,138,139,142,143,146,147,150,151,154,155,158,159<br>Thread(s) per core: 2<br>Core(s) per socket: 20<br>Socket(s): 2<br>NUMA node(s): 6<br>Model: 2.2 (pvr 004e 1202)<br>Model name: POWER9, altivec supported<br>CPU max MHz: 3800.0000<br>CPU min MHz: 2300.0000<br>L1d cache: 32K<br>L1i cache: 32K<br>L2 cache: 512K<br>L3 cache: 10240K<br>NUMA node0 CPU(s): 0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77<br>NUMA node8 CPU(s): 80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157<br></span></span></p></div><div dir="ltr">...</div><div dir="ltr"><br></div><div>For SMT4, it looks like this:</div><div><br></div><div>Architecture: ppc64le<br>Byte Order: Little Endian<br>CPU(s): 160<br>On-line CPU(s) list: 0-159<br>Thread(s) per core: 4<br>Core(s) per socket: 20<br>Socket(s): 2<br>NUMA node(s): 6<br>Model: 2.2 (pvr 004e 1202)<br>Model name: POWER9, altivec supported<br>CPU max MHz: 3800.0000<br>CPU min MHz: 2300.0000<br>L1d cache: 32K<br>L1i cache: 32K<br>L2 cache: 512K<br>L3 cache: 10240K<br>NUMA node0 CPU(s): 0-79<br>NUMA node8 CPU(s): 80-159</div><div><br></div><div><br></div><div dir="ltr">> <br>> > 2.) Another concern is the gres.conf. Slurm seems to have trouble taking<br>> > processor ID's that are > "#Sockets". The true processor ID as given by<br>> > nvidia-smi topo -m output will range up to 159, and slurm doesn't like<br>> > this. Are we to use "Cores=" entries in gres.conf, and use the number of<br>> > the physical cores, instead of what nvidia-smi outputs?<br>> <br>> Again I *think* Slurm is using hwloc's logical CPU numbering for this, so<br>> lstopo should help - using a quick snippet on my local PC (HT enabled) here:<br>> <br>> Package L#0 + L3 L#0 (8192KB)<br>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 <br>> PU L#0 (P#0)<br>> PU L#1 (P#4)<br>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 <br>> PU L#2 (P#1)<br>> PU L#3 (P#5)<br>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 <br>> PU L#4 (P#2)<br>> PU L#5 (P#6)<br>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 <br>> PU L#6 (P#3)<br>> PU L#7 (P#7)<br>> <br>> you can see that the logical numbering (L#0 and L#1) is done to be contiguous<br>> compared to how the firmware has enumerated the CPUs.<br>> <br>> > 3.) A related gres.conf question: there seems to be no documentation of<br>> > using "CPUs=" instead of "Cores=", yet I have seen several online examples<br>> > using "CPUs=" (and I myself have used it on an x86 system without issue).<br>> > Should one use "Cores" instead of "CPUs", when specifying binding to<br>> > specific GPUs?<br>> <br>> I think CPUs= was the older syntax which has been replaced with Cores=.<br>> <br>> The gres.conf we use on our HPC cluster uses Cores= quite happily.<br>> <br>> Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17<br>> Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35<br>
<br></div><div>We will try setting Cores as numbered by "Core L#n" and see how that works for us. We are using cgroup enforcement, so for a particular user job, they will only see the GPUs they allocate, and I expect that output of "nvidia-smi topo -m" will be similarly affected, in that the cores/threads listed in the output will just be sequential IDs for the cores/threads requested, not the P# ID's reported if "nvidia-smi topo -m" is run by root outside of a slurm-controlled job.</div><div><br></div><div>SMT lstopo output looks like this:</div><div><br></div><div>Machine (570GB total)<br> Group0 L#0<br> NUMANode L#0 (P#0 252GB)<br> Package L#0<br> L3 L#0 (10MB) + L2 L#0 (512KB)<br> L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0<br> PU L#0 (P#0)<br> PU L#1 (P#1)<br> L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1<br> PU L#2 (P#4)<br> PU L#3 (P#5)<br> L3 L#1 (10MB) + L2 L#1 (512KB)<br> L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2<br> PU L#4 (P#8)<br> PU L#5 (P#9)<br> L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3<br> PU L#6 (P#12)<br> PU L#7 (P#13)<br>...</div><div> L3 L#9 (10MB) + L2 L#9 (512KB)<br> L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18<br> PU L#36 (P#72)<br> PU L#37 (P#73)<br> L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19<br> PU L#38 (P#76)<br> PU L#39 (P#77)<br>...</div><div> NUMANode L#1 (P#8 256GB)<br> Package L#1<br> L3 L#10 (10MB) + L2 L#10 (512KB)<br> L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20<br> PU L#40 (P#80)<br> PU L#41 (P#81)<br> L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21<br> PU L#42 (P#84)<br> PU L#43 (P#85)<br>...</div><div> L3 L#19 (10MB) + L2 L#19 (512KB)<br> L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38<br> PU L#76 (P#152)<br> PU L#77 (P#153)<br> L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39<br> PU L#78 (P#156)<br> PU L#79 (P#157)<br></div><div><br></div><div>So my guess here is that GPU0,GPU1 would get Cores=0-19, and GPU2,GPU3 get Cores=20-39 as numbered by lstopo?</div><div><br></div><div>- Keith Ball<br></div></div></div></div></div></div></div></div>