[slurm-users] 'slurmd -c' not returning correct information

Prentice Bisbal pbisbal at pppl.gov
Thu Jan 17 20:38:54 UTC 2019


Nevermind. This was a layer 8 problem. I was editing the wrong 
slurm.conf. We recently switched to using RPMs, and I was accidentally 
edited the file in the location used before we switched to using RPMs. 
It turns out those errors were always there in slurmctld.log, and no one 
ever noticed. Now that I am using the output of 'slurmd -C'  in the 
correct file, those errors have gone away.

What is interesting is the configuration produced by Slurmd -C treats 
each NUMA node as a separate socket (4 sockets) so the old configuration 
in slurm.conf matched the physical configuration (2 sockets), so the 
'correct' physical configuration had been causing those errors.

Prentice

On 1/17/19 3:09 PM, Prentice Bisbal wrote:
> It appears that 'slurmd -C is not returning the correct information 
> for some of the systems in my very heterogeneous cluster.
>
> For example, take the node dawson081:
>
> [root at dawson081 ~]# slurmd -C
> NodeName=dawson081 slurmd: Considering each NUMA node as a socket
> CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 
> RealMemory=64554
> UpTime=2-09:30:47
>
> Since Boards and CPUS are mutually exclusive, I omitted CPUs and added 
> this line to my slurm.conf:
>
> NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117] 
> Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 
> RealMemory=64554
>
> When I restart slurm, however, I get the following messages in 
> slurmctld.log:
>
> [2019-01-17T14:54:47.788] error: Node dawson081 has high 
> socket,core,thread count (4,8,1 > 2,16,1), extra resources ignored
>
> lscpu on that same node shows a different hardware layout:
>
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                32
> On-line CPU(s) list:   0-31
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             2
> NUMA node(s):          4
> Vendor ID:             AuthenticAMD
> CPU family:            21
> Model:                 1
> Model name:            AMD Opteron(TM) Processor 6274
> Stepping:              2
> CPU MHz:               2200.000
> BogoMIPS:              4399.39
> Virtualization:        AMD-V
> L1d cache:             16K
> L1i cache:             64K
> L2 cache:              2048K
> L3 cache:              6144K
> NUMA node0 CPU(s):     0-7
> NUMA node1 CPU(s):     8-15
> NUMA node2 CPU(s):     16-23
> NUMA node3 CPU(s):     24-31
>
> Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMs 
> for both at the same time on the same system, so they were linked to 
> the same hwloc. Any ideas why there's a discrepancy? How should I deal 
> with this?
>
> Both the compute node and the Slurm controller are using CentOS 6.10 
> and have hwloc-1.5-3 installed.
>
> Thanks for the help
>



More information about the slurm-users mailing list