[slurm-users] slurmd -C showing incorrect core count

Wed Mar 11 06:14:38 UTC 2020

On Tue, Mar 10, 2020 at 1:41 PM mike tie <mtie at carleton.edu> wrote:

> Here is the output of lstopo
>

> *$* lstopo -p
>
> Machine (63GB)
>
>   Package P#0 + L3 (16MB)
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3
>
>   Package P#1 + L3 (16MB)
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7
>
>   Package P#2 + L3 (16MB)
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11
>
>   Package P#3 + L3 (16MB)
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14
>
>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15
>

There is no sane way to derive the number 10 from this topology. obviously:
it has a prime factor of 5, but everything in the lstopo output is sized in
powers of 2 (4 packages, a.k.a.  sockets, 4 single-threaded CPU cores per).

I responded yesterday but somehow managed to plop my signature into the
middle of it, so maybe you have missed inline replies?

It's very, very likely that the number is stored *somewhere*. First to
eliminate is the hypothesis that the number is acquired from the control
daemon. That's the simplest step and the largest landgrab in the
divide-and-conquer analysis plan. Then just look where it comes from on the
VM. strace(1) will reveal all files slurmd reads.

You are not rolling out the VMs from an image, ain't you? I'm wondering why
do you need to tweak an existing VM that is already in a weird state. Is
simply setting its snapshot aside and creating a new one from an image
hard/impossible? I did not touch VMWare for more than 10 years, so I may be
a bit naive; in the platform I'm working now (GCE), create-use-drop pattern
of VM use is much more common and simpler than create and maintain it to
either *ad infinitum* or *ad nauseam*, whichever will have been reached the
earliest.  But I don't know anything about VMWare; maybe it's not possible
or feasible with it.

 -kkm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200310/61b82f82/attachment-0001.htm>