[slurm-users] 'slurmd -c' not returning correct information

Prentice Bisbal pbisbal at pppl.gov
Thu Jan 17 20:09:37 UTC 2019


It appears that 'slurmd -C is not returning the correct information for 
some of the systems in my very heterogeneous cluster.

For example, take the node dawson081:

[root at dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 
RealMemory=64554
UpTime=2-09:30:47

Since Boards and CPUS are mutually exclusive, I omitted CPUs and added 
this line to my slurm.conf:

NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117] 
Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 
RealMemory=64554

When I restart slurm, however, I get the following messages in 
slurmctld.log:

[2019-01-17T14:54:47.788] error: Node dawson081 has high 
socket,core,thread count (4,8,1 > 2,16,1), extra resources ignored

lscpu on that same node shows a different hardware layout:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Model name:            AMD Opteron(TM) Processor 6274
Stepping:              2
CPU MHz:               2200.000
BogoMIPS:              4399.39
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31

Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMs 
for both at the same time on the same system, so they were linked to the 
same hwloc. Any ideas why there's a discrepancy? How should I deal with 
this?

Both the compute node and the Slurm controller are using CentOS 6.10 and 
have hwloc-1.5-3 installed.

Thanks for the help

-- 
Prentice




More information about the slurm-users mailing list