[slurm-users] slurmd -C showing incorrect core count

Wed Mar 11 14:26:18 UTC 2020

Yep, slurmd -C is obviously getting the data from somewhere, either a local
file or from the master node.  hence my email to the group;  I was hoping
that someone would just say:  "yeah, modify file xxxx".  But oh well. I'll
start playing with strace and gdb later this week;  looking through the
source might also be helpful.

I'm not cloning existing virtual machines with slurm.  I have access to a
vmware system that from time to time isn't running at full capacity;  usage
is stable for blocks of a month or two at a time, so my thought/plan was to
spin up a slurm compute node  on it, and resize it appropriately every few
months (why not put it to work).  I started with 10 cores, and it looks
like I can up it to 16 cores for a while, and that's when I ran into the
problem.

-mike

*Michael Tie    *Technical Director
Mathematics, Statistics, and Computer Science

 One North College Street              phn:  507-222-4067
 Northfield, MN 55057                   cel:    952-212-8933
 mtie at carleton.edu                        fax:    507-222-4312

On Wed, Mar 11, 2020 at 1:15 AM Kirill 'kkm' Katsnelson <kkm at pobox.com>
wrote:

> On Tue, Mar 10, 2020 at 1:41 PM mike tie <mtie at carleton.edu> wrote:
>
>> Here is the output of lstopo
>>
>
>> *$* lstopo -p
>>
>> Machine (63GB)
>>
>>   Package P#0 + L3 (16MB)
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3
>>
>>   Package P#1 + L3 (16MB)
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7
>>
>>   Package P#2 + L3 (16MB)
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11
>>
>>   Package P#3 + L3 (16MB)
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14
>>
>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15
>>
>
> There is no sane way to derive the number 10 from this topology.
> obviously: it has a prime factor of 5, but everything in the lstopo output
> is sized in powers of 2 (4 packages, a.k.a.  sockets, 4 single-threaded CPU
> cores per).
>
> I responded yesterday but somehow managed to plop my signature into the
> middle of it, so maybe you have missed inline replies?
>
> It's very, very likely that the number is stored *somewhere*. First to
> eliminate is the hypothesis that the number is acquired from the control
> daemon. That's the simplest step and the largest landgrab in the
> divide-and-conquer analysis plan. Then just look where it comes from on the
> VM. strace(1) will reveal all files slurmd reads.
>
> You are not rolling out the VMs from an image, ain't you? I'm wondering
> why do you need to tweak an existing VM that is already in a weird state.
> Is simply setting its snapshot aside and creating a new one from an image
> hard/impossible? I did not touch VMWare for more than 10 years, so I may be
> a bit naive; in the platform I'm working now (GCE), create-use-drop pattern
> of VM use is much more common and simpler than create and maintain it to
> either *ad infinitum* or *ad nauseam*, whichever will have been reached the
> earliest.  But I don't know anything about VMWare; maybe it's not possible
> or feasible with it.
>
>  -kkm
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200311/4021f7cf/attachment-0001.htm>