[slurm-users] 4 sockets but "
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jul 20 18:30:47 UTC 2021
Hi Diego,
>> 2. Did you define a Sub NUMA Cluster (SNC) BIOS setting? Then each
>> physical socket would show up as two sockets (memory controllers), for
>> a total of 8 "sockets" in your 4-socket system.
> I don't think so. Unless that's the default, I didn't change anything in
> the BIOS. Just checked the second of the three nodes (still no SO
> installed) and found it under Chipset Configuration -> North bridge ->
> UPI configuration -> SNC : it's set to Disable.
OK. Performance may be a bit higher with SNC enabled.
>> 3. Which Slurm version are you running, and which OS version?
> Slurm 18.08.5, standard packages from Debian stable.
Uh, that's an old Slurm which will have many bugs that are fixed in
later releases. It seems that a number of sites use the very old Debian
distribution packages rather than modern Slurm versions :-(
>> 4. What is the output of "slurmd -C" (Print actual hardware
>> configuration) on the node?
> root at str957-mtx-20:~# slurmd -C
> NodeName=str957-mtx-20 CPUs=192 Boards=1 SocketsPerBoard=4
> CoresPerSocket=24 ThreadsPerCore=2 RealMemory=1160347
> UpTime=1-01:50:48
> So the node seems to correctly recognizes the underlying HW.
>
> I tried to copy&paste "SocketsPerBoard=4 CoresPerSocket=24
> ThreadsPerCore=2 RealMemory=1160347" in the nodes definition line (the
> only difference being SocketsPerBoard instead of Sockets), but the
> result is always the same.
In 20.02 there was a bug with Boards and SocketsPerBoard, don't remember
if that was a problem in 18.08 as well. See the link below for references.
Maybe delete Boards=1 SocketsPerBoard=4 and try Sockets=4 in stead?
>> Maybe my Wiki notes could be helpful?
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#compute-node-configuration
>
> Tks. Interesting, but I don't se pam_slurm_adopt. Other than that, it
> seems very much like what I'm doing.
The pam_slurm_adopt is very useful :-)
>> On 7/20/21 12:49 PM, Diego Zuccato wrote:
>>> Hello all.
>>>
>>> It's been since yesterday that I'm facing this issue.
>>> I'm configuring 3 new quad-socket nodes defined as:
>>> NodeName=str957-mtx-[20-22] Sockets=4 CoresPerSocket=24 \
>>> RealMemory=1160347 Weight=8 Feature=ib,matrix,intel,avx
>>>
>>> But scontrol show node str957-mtx-20 reports:
>>> NodeName=str957-mtx-20 CoresPerSocket=1
>>> [...]
>>> RealMemory=1160347 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
>>> MemSpecLimit=2048
>>> State=DOWN*+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=8 Owner=N/A
>>> MCS_label=N/A
>>> Partitions=m3
>>> BootTime=None SlurmdStartTime=None
>>> CfgTRES=cpu=4,mem=1160347M,billing=2783
>>> AllocTRES=
>>> CapWatts=n/a
>>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>> Reason=Low socket*core*thread count, Low CPUs
>>>
>>> The only thing that's right is that ThreadsPerCore=2 ...
>>> The last block from "cat /proc/cpuinfo" on the node reports:
>>> processor : 191
>>> vendor_id : GenuineIntel
>>> cpu family : 6
>>> model : 85
>>> model name : Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
>>> stepping : 7
>>> microcode : 0x5003003
>>> cpu MHz : 1200.331
>>> cache size : 36608 KB
>>> physical id : 3
>>> siblings : 48
>>> core id : 29
>>> cpu cores : 24
>>> apicid : 251
>>> initial apicid : 251
>>> fpu : yes
>>> fpu_exception : yes
>>> cpuid level : 22
>>> wp : yes
>>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
>>> pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs
>>> bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni
>>> pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16
>>> xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
>>> 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin
>>> ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority
>>> ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
>>> cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb
>>> intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves
>>> cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln
>>> pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
>>> bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs taa
>>> itlb_multihit
>>> bogomips : 5804.13
>>> clflush size : 64
>>> cache_alignment : 64
>>> address sizes : 46 bits physical, 48 bits virtual
>>> power management:
>>>
>>> I already tried changing the line to cheat (telling it there are only
>>> 2 sockets with 24 cores, thus reducing to half node) but nothing
>>> changed. I restarted slurmctld after every change in slurm.conf just
>>> to be sure.
>>> Any idea?
More information about the slurm-users
mailing list