[slurm-users] 'slurmd -c' not returning correct information
Prentice Bisbal
pbisbal at pppl.gov
Thu Jan 17 20:09:37 UTC 2019
It appears that 'slurmd -C is not returning the correct information for
some of the systems in my very heterogeneous cluster.
For example, take the node dawson081:
[root at dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1
RealMemory=64554
UpTime=2-09:30:47
Since Boards and CPUS are mutually exclusive, I omitted CPUs and added
this line to my slurm.conf:
NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117]
Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1
RealMemory=64554
When I restart slurm, however, I get the following messages in
slurmctld.log:
[2019-01-17T14:54:47.788] error: Node dawson081 has high
socket,core,thread count (4,8,1 > 2,16,1), extra resources ignored
lscpu on that same node shows a different hardware layout:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 21
Model: 1
Model name: AMD Opteron(TM) Processor 6274
Stepping: 2
CPU MHz: 2200.000
BogoMIPS: 4399.39
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMs
for both at the same time on the same system, so they were linked to the
same hwloc. Any ideas why there's a discrepancy? How should I deal with
this?
Both the compute node and the Slurm controller are using CentOS 6.10 and
have hwloc-1.5-3 installed.
Thanks for the help
--
Prentice
More information about the slurm-users
mailing list