[slurm-users] slurmd -C showing incorrect core count

mike tie mtie at carleton.edu
Mon Mar 9 14:44:39 UTC 2020


Interesting.   I'm still confused by the where slurmd -C is getting the
data.  When I think of where the kernel stores info about the processor, I
normally think of /proc/cpuinfo. (by the way, I am running centos 7 in the
vm.  The vm hypervisor is VMware).  /proc/cpuinfo does show 16 cores.

I understand your concern over the processor speed.  So I tried a different
vm where I see the following specs:

vendor_id : GenuineIntel

cpu family : 6

model : 85

model name : Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz


When I increase the core count on that vm, reboot, and run slurm -C it too
continues to show the lower original core count.

Specifically, how is slurmd -C getting that info?  Maybe this is a kernel
issue, but other than lscpu and /proc/cpuinfo, I don't know where to look.
Maybe I should be looking at the slurmd source?

-Mike



*Michael Tie    *Technical Director
Mathematics, Statistics, and Computer Science

 One North College Street              phn:  507-222-4067
 Northfield, MN 55057                   cel:    952-212-8933
 mtie at carleton.edu                        fax:    507-222-4312


On Sun, Mar 8, 2020 at 7:32 PM Kirill 'kkm' Katsnelson <kkm at pobox.com>
wrote:

> To answer your direct question, the ground truth of 'slurmctld -C' is what
> the kernel thinks the hardware is (what you see in lscpu, except it
> probably employs some tricks for VMs with an odd topology). And it got
> severely confused by what the kernel reported to it. I know from experience
> that certain odd cloud VM shapes throw it off balance.
>
> I do not really like the output of lscpu. I have never seen such a strange
> shape of a VM. CPU family 15 is in the Pentium 4 line <
> https://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers>,
> and model 6 was the last breath of this unsuccessful NetBurst
> architecture--such a rarity that Linux kernel does not even have in its
> database: "Common KVM processor" is a slug for "everything else that one
> of these soul-sapping KVMs may return". Flags show that the processor
> supports SSE2 and 3, but not 4.1, 4.2 or AVX, which is consistent with a
> Pentium 4, but 16M of L3 cache is about an average total RAM in a desktop
> at the time P4 was a thing. And the CPU is a NUMA (no real Pentium 4 had
> the NUMA, only SMP)¹.
>
> Any advice?
>>
>
> My best advice would be to either use a different hypervisor or tune
> correctly the one you have. Sometimes a hypervisor is tuned for live VM
> migration, when it is frozen on one hardware type and thawed on another,
> and may tweak the CPUID in advance to hide features from the guest OS so
> that it would be able to continue if migrated to less capable hardware; but
> still, using the P4 as the least common denominator is way too extreme.
> Something is seriously wrong on the KVM host.
>
> The VM itself is braindead. Even if you will have got it up and running,
> the absence of SSE4.1 and 4.2, AVX, AXV2, and AVX512² would make it about
> as efficient a computing node as a brick. Unless the host CPU is really a
> Presler Pentium 4, in which case you are way too long overdue for a
> hardware upgrade :)))
>
>  -kkm
>   ____
> ¹ It's not impossible that lscpu shows an SMP machine as if containing a
> single NUMA node, but I have a recollection that this is not the case. I
> haven't seen a non-NUMA CPU in quite a while.
> ² Intel had gone besides-itself-creative this time. It was even bigger a
> naming leap than switching from Roman to decimal between Pentium III to
> Pentium *drum roll* 4 *cymbal crash*.
>
>
> On Sun, Mar 8, 2020 at 1:20 PM mike tie <mtie at carleton.edu> wrote:
>
>>
>> I am running a slurm client on a virtual machine.  The virtual machine
>> originally had a core count of 10.  But I have now increased the cores to
>> 16, but "slurmd -C" continues to show 10.  I have increased the core count
>> in the slurm.conf file. and that is being seen correctly.  The state of the
>> node is stuck in a Drain state because of this conflict.  How do I get
>> slurmd -C to see the new number of cores?
>>
>> I'm running slurm 18.08.  I have tried running "scontrol reconfigure" on
>> the head node.  I have restarted slurmd on all the client nodes, and I have
>> restarted slurmctld on the master node.
>>
>> Where is the data about compute note CPUs stored?  I can't seem to find a
>> config or setting file on the compute node.
>>
>> The compute node that I am working on is "liverpool"
>>
>> *mtie at liverpool** ~ $* slurmd -C
>>
>> NodeName=liverpool CPUs=10 Boards=1 SocketsPerBoard=10 CoresPerSocket=1
>> ThreadsPerCore=1 RealMemory=64263
>>
>> UpTime=1-21:55:36
>>
>>
>> *mtie at liverpool** ~ $* lscpu
>>
>> Architecture:          x86_64
>>
>> CPU op-mode(s):        32-bit, 64-bit
>>
>> Byte Order:            Little Endian
>>
>> CPU(s):                16
>>
>> On-line CPU(s) list:   0-15
>>
>> Thread(s) per core:    1
>>
>> Core(s) per socket:    4
>>
>> Socket(s):             4
>>
>> NUMA node(s):          1
>>
>> Vendor ID:             GenuineIntel
>>
>> CPU family:            15
>>
>> Model:                 6
>>
>> Model name:            Common KVM processor
>>
>> Stepping:              1
>>
>> CPU MHz:               2600.028
>>
>> BogoMIPS:              5200.05
>>
>> Hypervisor vendor:     KVM
>>
>> Virtualization type:   full
>>
>> L1d cache:             32K
>>
>> L1i cache:             32K
>>
>> L2 cache:              4096K
>>
>> L3 cache:              16384K
>>
>> NUMA node0 CPU(s):     0-15
>>
>> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm
>> constant_tsc nopl xtopology eagerfpu pni cx16 x2apic hypervisor lahf_lm
>>
>>
>> *mtie at liverpool** ~ $* more /etc/slurm/slurm.conf | grep liverpool
>>
>> NodeName=*liverpool* NodeAddr=137.22.10.202 CPUs=16 State=UNKNOWN
>>
>> PartitionName=BioSlurm Nodes=*liverpool*  Default=YES MaxTime=INFINITE
>> State=UP
>>
>>
>> *mtie at liverpool** ~ $* sinfo -n liverpool -o %c
>>
>> CPUS
>>
>> 16
>>
>> *mtie at liverpool** ~ $* sinfo -n liverpool -o %E
>>
>> REASON
>>
>> Low socket*core*thread count, Low CPUs
>>
>>
>>
>> Any advice?
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200309/52cf8730/attachment-0001.htm>


More information about the slurm-users mailing list