[slurm-users] slurmd -C showing incorrect core count

Kirill 'kkm' Katsnelson kkm at pobox.com
Thu Mar 12 03:50:17 UTC 2020


Yup, I think if you get stuck so badly, the first thing is to make sure the
node does not get the number 10 from the controller, and the second just
reimage the VM fresh. It maybe not the quickest way, but at least
predictable in the sense of time spent.

Good luck!

 -kkm

On Wed, Mar 11, 2020 at 7:28 AM mike tie <mtie at carleton.edu> wrote:

>
> Yep, slurmd -C is obviously getting the data from somewhere, either a
> local file or from the master node.  hence my email to the group;  I was
> hoping that someone would just say:  "yeah, modify file xxxx".  But oh
> well. I'll start playing with strace and gdb later this week;  looking
> through the source might also be helpful.
>
> I'm not cloning existing virtual machines with slurm.  I have access to a
> vmware system that from time to time isn't running at full capacity;  usage
> is stable for blocks of a month or two at a time, so my thought/plan was to
> spin up a slurm compute node  on it, and resize it appropriately every few
> months (why not put it to work).  I started with 10 cores, and it looks
> like I can up it to 16 cores for a while, and that's when I ran into the
> problem.
>
> -mike
>
>
>
> *Michael Tie    *Technical Director
> Mathematics, Statistics, and Computer Science
>
>  One North College Street              phn:  507-222-4067
>  Northfield, MN 55057                   cel:    952-212-8933
>  mtie at carleton.edu                        fax:    507-222-4312
>
>
> On Wed, Mar 11, 2020 at 1:15 AM Kirill 'kkm' Katsnelson <kkm at pobox.com>
> wrote:
>
>> On Tue, Mar 10, 2020 at 1:41 PM mike tie <mtie at carleton.edu> wrote:
>>
>>> Here is the output of lstopo
>>>
>>
>>> *$* lstopo -p
>>>
>>> Machine (63GB)
>>>
>>>   Package P#0 + L3 (16MB)
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3
>>>
>>>   Package P#1 + L3 (16MB)
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7
>>>
>>>   Package P#2 + L3 (16MB)
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11
>>>
>>>   Package P#3 + L3 (16MB)
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14
>>>
>>>     L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15
>>>
>>
>> There is no sane way to derive the number 10 from this topology.
>> obviously: it has a prime factor of 5, but everything in the lstopo output
>> is sized in powers of 2 (4 packages, a.k.a.  sockets, 4 single-threaded CPU
>> cores per).
>>
>> I responded yesterday but somehow managed to plop my signature into the
>> middle of it, so maybe you have missed inline replies?
>>
>> It's very, very likely that the number is stored *somewhere*. First to
>> eliminate is the hypothesis that the number is acquired from the control
>> daemon. That's the simplest step and the largest landgrab in the
>> divide-and-conquer analysis plan. Then just look where it comes from on the
>> VM. strace(1) will reveal all files slurmd reads.
>>
>> You are not rolling out the VMs from an image, ain't you? I'm wondering
>> why do you need to tweak an existing VM that is already in a weird state.
>> Is simply setting its snapshot aside and creating a new one from an image
>> hard/impossible? I did not touch VMWare for more than 10 years, so I may be
>> a bit naive; in the platform I'm working now (GCE), create-use-drop pattern
>> of VM use is much more common and simpler than create and maintain it to
>> either *ad infinitum* or *ad nauseam*, whichever will have been reached the
>> earliest.  But I don't know anything about VMWare; maybe it's not possible
>> or feasible with it.
>>
>>  -kkm
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200311/aa80bd6b/attachment.htm>


More information about the slurm-users mailing list