[slurm-users] 4 sockets but "

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Jul 20 11:23:46 UTC 2021


Hi Diego,

The Xeon Platinum 8268 is a 24-core CPU:
https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html

Questions:

1. So you have 4 physical sockets in each node?

2. Did you define a Sub NUMA Cluster (SNC) BIOS setting?  Then each 
physical socket would show up as two sockets (memory controllers), for a 
total of 8 "sockets" in your 4-socket system.

3. Which Slurm version are you running, and which OS version?

4. What is the output of "slurmd -C" (Print actual hardware configuration) 
on the node?

FYI: We have some dual-socket Xeon Cascade Lake (20 cores) servers with 
SNC enabled, and I have in slurm.conf:

NodeName=s001 Sockets=4 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=191000

That is a total of 80 cores including the Hyperthreading.

Maybe my Wiki notes could be helpful?
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#compute-node-configuration

Best regards,
Ole


On 7/20/21 12:49 PM, Diego Zuccato wrote:
> Hello all.
> 
> It's been since yesterday that I'm facing this issue.
> I'm configuring 3 new quad-socket nodes defined as:
> NodeName=str957-mtx-[20-22] Sockets=4 CoresPerSocket=24 \
>     RealMemory=1160347 Weight=8 Feature=ib,matrix,intel,avx
> 
> But scontrol show node str957-mtx-20 reports:
> NodeName=str957-mtx-20 CoresPerSocket=1
> [...]
>     RealMemory=1160347 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
>     MemSpecLimit=2048
>     State=DOWN*+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=8 Owner=N/A 
> MCS_label=N/A
>     Partitions=m3
>     BootTime=None SlurmdStartTime=None
>     CfgTRES=cpu=4,mem=1160347M,billing=2783
>     AllocTRES=
>     CapWatts=n/a
>     CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>     Reason=Low socket*core*thread count, Low CPUs
> 
> The only thing that's right is that ThreadsPerCore=2 ...
> The last block from "cat /proc/cpuinfo" on the node reports:
> processor       : 191
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 85
> model name      : Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
> stepping        : 7
> microcode       : 0x5003003
> cpu MHz         : 1200.331
> cache size      : 36608 KB
> physical id     : 3
> siblings        : 48
> core id         : 29
> cpu cores       : 24
> apicid          : 251
> initial apicid  : 251
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 22
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
> xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl 
> vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
> x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm 
> abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin 
> ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept 
> vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx 
> rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc 
> cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni 
> md_clear flush_l1d arch_capabilities
> bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa 
> itlb_multihit
> bogomips        : 5804.13
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
> 
> I already tried changing the line to cheat (telling it there are only 2 
> sockets with 24 cores, thus reducing to half node) but nothing changed. I 
> restarted slurmctld after every change in slurm.conf just to be sure.
> Any idea?



More information about the slurm-users mailing list