[slurm-users] 4 sockets but "
mercan
ahmet.mercan at uhem.itu.edu.tr
Tue Jul 20 16:02:03 UTC 2021
Hi;
Did you check slurmctld log for a complain about the host line. if the
slumctld can not recognize a parameter, may be it give up processing
whole host line.
Ahmet M.
20.07.2021 13:49 tarihinde Diego Zuccato yazdı:
> Hello all.
>
> It's been since yesterday that I'm facing this issue.
> I'm configuring 3 new quad-socket nodes defined as:
> NodeName=str957-mtx-[20-22] Sockets=4 CoresPerSocket=24 \
> RealMemory=1160347 Weight=8 Feature=ib,matrix,intel,avx
>
> But scontrol show node str957-mtx-20 reports:
> NodeName=str957-mtx-20 CoresPerSocket=1
> [...]
> RealMemory=1160347 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
> MemSpecLimit=2048
> State=DOWN*+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=8 Owner=N/A
> MCS_label=N/A
> Partitions=m3
> BootTime=None SlurmdStartTime=None
> CfgTRES=cpu=4,mem=1160347M,billing=2783
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low socket*core*thread count, Low CPUs
>
> The only thing that's right is that ThreadsPerCore=2 ...
> The last block from "cat /proc/cpuinfo" on the node reports:
> processor : 191
> vendor_id : GenuineIntel
> cpu family : 6
> model : 85
> model name : Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
> stepping : 7
> microcode : 0x5003003
> cpu MHz : 1200.331
> cache size : 36608 KB
> physical id : 3
> siblings : 48
> core id : 29
> cpu cores : 24
> apicid : 251
> initial apicid : 251
> fpu : yes
> fpu_exception : yes
> cpuid level : 22
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm
> pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
> xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
> cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp
> ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase
> tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f
> avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw
> avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
> cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke
> avx512_vnni md_clear flush_l1d arch_capabilities
> bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs taa
> itlb_multihit
> bogomips : 5804.13
> clflush size : 64
> cache_alignment : 64
> address sizes : 46 bits physical, 48 bits virtual
> power management:
>
> I already tried changing the line to cheat (telling it there are only
> 2 sockets with 24 cores, thus reducing to half node) but nothing
> changed. I restarted slurmctld after every change in slurm.conf just
> to be sure.
> Any idea?
>
> Tks.
>
More information about the slurm-users
mailing list