[slurm-users] 4 sockets but "

Diego Zuccato diego.zuccato at unibo.it
Tue Jul 20 14:02:17 UTC 2021


Il 20/07/2021 13:23, Ole Holm Nielsen ha scritto:

Hello Ole.

> The Xeon Platinum 8268 is a 24-core CPU:
> https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html 
Yup.

> 1. So you have 4 physical sockets in each node?
Correct.

> 2. Did you define a Sub NUMA Cluster (SNC) BIOS setting?  Then each 
> physical socket would show up as two sockets (memory controllers), for a 
> total of 8 "sockets" in your 4-socket system.
I don't think so. Unless that's the default, I didn't change anything in 
the BIOS. Just checked the second of the three nodes (still no SO 
installed) and found it under Chipset Configuration -> North bridge -> 
UPI configuration -> SNC : it's set to Disable.
> 3. Which Slurm version are you running, and which OS version?
Slurm 18.08.5, standard packages from Debian stable.

> 4. What is the output of "slurmd -C" (Print actual hardware 
> configuration) on the node?
root at str957-mtx-20:~# slurmd -C
NodeName=str957-mtx-20 CPUs=192 Boards=1 SocketsPerBoard=4 
CoresPerSocket=24 ThreadsPerCore=2 RealMemory=1160347
UpTime=1-01:50:48
So the node seems to correctly recognizes the underlying HW.

I tried to copy&paste "SocketsPerBoard=4 CoresPerSocket=24 
ThreadsPerCore=2 RealMemory=1160347" in the nodes definition line (the 
only difference being SocketsPerBoard instead of Sockets), but the 
result is always the same.

> Maybe my Wiki notes could be helpful?
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#compute-node-configuration 
Tks. Interesting, but I don't se pam_slurm_adopt. Other than that, it 
seems very much like what I'm doing.

BYtE,
  Diego

> On 7/20/21 12:49 PM, Diego Zuccato wrote:
>> Hello all.
>>
>> It's been since yesterday that I'm facing this issue.
>> I'm configuring 3 new quad-socket nodes defined as:
>> NodeName=str957-mtx-[20-22] Sockets=4 CoresPerSocket=24 \
>>     RealMemory=1160347 Weight=8 Feature=ib,matrix,intel,avx
>>
>> But scontrol show node str957-mtx-20 reports:
>> NodeName=str957-mtx-20 CoresPerSocket=1
>> [...]
>>     RealMemory=1160347 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
>>     MemSpecLimit=2048
>>     State=DOWN*+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=8 Owner=N/A 
>> MCS_label=N/A
>>     Partitions=m3
>>     BootTime=None SlurmdStartTime=None
>>     CfgTRES=cpu=4,mem=1160347M,billing=2783
>>     AllocTRES=
>>     CapWatts=n/a
>>     CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>     Reason=Low socket*core*thread count, Low CPUs
>>
>> The only thing that's right is that ThreadsPerCore=2 ...
>> The last block from "cat /proc/cpuinfo" on the node reports:
>> processor       : 191
>> vendor_id       : GenuineIntel
>> cpu family      : 6
>> model           : 85
>> model name      : Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
>> stepping        : 7
>> microcode       : 0x5003003
>> cpu MHz         : 1200.331
>> cache size      : 36608 KB
>> physical id     : 3
>> siblings        : 48
>> core id         : 29
>> cpu cores       : 24
>> apicid          : 251
>> initial apicid  : 251
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 22
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
>> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts 
>> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq 
>> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm 
>> pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes 
>> xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 
>> cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp 
>> ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase 
>> tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f 
>> avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw 
>> avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc 
>> cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke 
>> avx512_vnni md_clear flush_l1d arch_capabilities
>> bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa 
>> itlb_multihit
>> bogomips        : 5804.13
>> clflush size    : 64
>> cache_alignment : 64
>> address sizes   : 46 bits physical, 48 bits virtual
>> power management:
>>
>> I already tried changing the line to cheat (telling it there are only 
>> 2 sockets with 24 cores, thus reducing to half node) but nothing 
>> changed. I restarted slurmctld after every change in slurm.conf just 
>> to be sure.
>> Any idea?
> 

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list