[slurm-users] slurmd -C showing incorrect core count
Kirill 'kkm' Katsnelson
kkm at pobox.com
Mon Mar 9 00:32:04 UTC 2020
To answer your direct question, the ground truth of 'slurmctld -C' is what
the kernel thinks the hardware is (what you see in lscpu, except it
probably employs some tricks for VMs with an odd topology). And it got
severely confused by what the kernel reported to it. I know from experience
that certain odd cloud VM shapes throw it off balance.
I do not really like the output of lscpu. I have never seen such a strange
shape of a VM. CPU family 15 is in the Pentium 4 line <
https://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers>,
and model 6 was the last breath of this unsuccessful NetBurst
architecture--such a rarity that Linux kernel does not even have in its
database: "Common KVM processor" is a slug for "everything else that one of
these soul-sapping KVMs may return". Flags show that the processor supports
SSE2 and 3, but not 4.1, 4.2 or AVX, which is consistent with a Pentium 4,
but 16M of L3 cache is about an average total RAM in a desktop at the time
P4 was a thing. And the CPU is a NUMA (no real Pentium 4 had the NUMA, only
SMP)¹.
Any advice?
>
My best advice would be to either use a different hypervisor or tune
correctly the one you have. Sometimes a hypervisor is tuned for live VM
migration, when it is frozen on one hardware type and thawed on another,
and may tweak the CPUID in advance to hide features from the guest OS so
that it would be able to continue if migrated to less capable hardware; but
still, using the P4 as the least common denominator is way too extreme.
Something is seriously wrong on the KVM host.
The VM itself is braindead. Even if you will have got it up and running,
the absence of SSE4.1 and 4.2, AVX, AXV2, and AVX512² would make it about
as efficient a computing node as a brick. Unless the host CPU is really a
Presler Pentium 4, in which case you are way too long overdue for a
hardware upgrade :)))
-kkm
____
¹ It's not impossible that lscpu shows an SMP machine as if containing a
single NUMA node, but I have a recollection that this is not the case. I
haven't seen a non-NUMA CPU in quite a while.
² Intel had gone besides-itself-creative this time. It was even bigger a
naming leap than switching from Roman to decimal between Pentium III to
Pentium *drum roll* 4 *cymbal crash*.
On Sun, Mar 8, 2020 at 1:20 PM mike tie <mtie at carleton.edu> wrote:
>
> I am running a slurm client on a virtual machine. The virtual machine
> originally had a core count of 10. But I have now increased the cores to
> 16, but "slurmd -C" continues to show 10. I have increased the core count
> in the slurm.conf file. and that is being seen correctly. The state of the
> node is stuck in a Drain state because of this conflict. How do I get
> slurmd -C to see the new number of cores?
>
> I'm running slurm 18.08. I have tried running "scontrol reconfigure" on
> the head node. I have restarted slurmd on all the client nodes, and I have
> restarted slurmctld on the master node.
>
> Where is the data about compute note CPUs stored? I can't seem to find a
> config or setting file on the compute node.
>
> The compute node that I am working on is "liverpool"
>
> *mtie at liverpool** ~ $* slurmd -C
>
> NodeName=liverpool CPUs=10 Boards=1 SocketsPerBoard=10 CoresPerSocket=1
> ThreadsPerCore=1 RealMemory=64263
>
> UpTime=1-21:55:36
>
>
> *mtie at liverpool** ~ $* lscpu
>
> Architecture: x86_64
>
> CPU op-mode(s): 32-bit, 64-bit
>
> Byte Order: Little Endian
>
> CPU(s): 16
>
> On-line CPU(s) list: 0-15
>
> Thread(s) per core: 1
>
> Core(s) per socket: 4
>
> Socket(s): 4
>
> NUMA node(s): 1
>
> Vendor ID: GenuineIntel
>
> CPU family: 15
>
> Model: 6
>
> Model name: Common KVM processor
>
> Stepping: 1
>
> CPU MHz: 2600.028
>
> BogoMIPS: 5200.05
>
> Hypervisor vendor: KVM
>
> Virtualization type: full
>
> L1d cache: 32K
>
> L1i cache: 32K
>
> L2 cache: 4096K
>
> L3 cache: 16384K
>
> NUMA node0 CPU(s): 0-15
>
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm
> constant_tsc nopl xtopology eagerfpu pni cx16 x2apic hypervisor lahf_lm
>
>
> *mtie at liverpool** ~ $* more /etc/slurm/slurm.conf | grep liverpool
>
> NodeName=*liverpool* NodeAddr=137.22.10.202 CPUs=16 State=UNKNOWN
>
> PartitionName=BioSlurm Nodes=*liverpool* Default=YES MaxTime=INFINITE
> State=UP
>
>
> *mtie at liverpool** ~ $* sinfo -n liverpool -o %c
>
> CPUS
>
> 16
>
> *mtie at liverpool** ~ $* sinfo -n liverpool -o %E
>
> REASON
>
> Low socket*core*thread count, Low CPUs
>
>
>
> Any advice?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200308/04ddbee6/attachment.htm>
More information about the slurm-users
mailing list