[slurm-users] slurmd -C showing incorrect core count

Kirill 'kkm' Katsnelson kkm at pobox.com
Tue Mar 10 10:05:44 UTC 2020


Yes, it's odd.


 -kkm

On Mon, Mar 9, 2020 at 7:44 AM mike tie <mtie at carleton.edu> wrote:

>
> Interesting.   I'm still confused by the where slurmd -C is getting the
> data.  When I think of where the kernel stores info about the processor, I
> normally think of /proc/cpuinfo. (by the way, I am running centos 7 in the
> vm.  The vm hypervisor is VMware).  /proc/cpuinfo does show 16 cores.
>

AFAIK, the topology can be queried from /sys/devices/system/node/node*/ <
https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html> and
/sys/devices/system/cpu/cpu*/topology.

Whether or not Slurm in fact gets the topology from there, I do not know.
The build has dependencies on both libhwloc and libnuma--that's a clue.


> I understand your concern over the processor speed.  So I tried a
> different vm where I see the following specs:
>

It's not even so much its speed per se, rather the way the hypervisor has
finely chopped the 16 virtual CPUs into 4 sockets without hyperthreads. It
makes no sense at all. I have a hunch that the other VM (the one that
reports the correct CPU) should rather put them into a single socket, at
least by default. But yeah, it does not answer the question where the
number 10 is popping up from.


> When I increase the core count on that vm, reboot, and run slurm -C it too
> continues to show the lower original core count.
>

Most likely it's stored somewhere on disk.


> Specifically, how is slurmd -C getting that info?  Maybe this is a kernel
> issue, but other than lscpu and /proc/cpuinfo, I don't know where to look.
>

I would not bet 1 to 100 on a kernel bug. The number is most likely to come
from either some stray config file, or a cache on disk. I don't know if
slurmd stores any cache, never had to look (all my nodes are virtual and
created and deleted on demand, thus always start fresh), but if it does,
it's somewhere under /var/lib/slurm*.

I thought (possibly incorrectly) that the switch -C reports the node size
and CPU configuration without even looking at config files. I would check
first if it talks to the controller at all (tweak e.g. the port number in
slurm.conf), and, if it does, what is the current slurmctld's idea about
this node (scontrol show node=<node>, IIRC, or something very much like
that).


>   Maybe I should be looking at the slurmd source?
>

slurmd should be much simpler than slurmctld, and the -C query must be a
straightforward, very synchronous operation. But reading sources is quite
time-consuming, so I would venture into it only as a last resort. Since -C
is not forking, it should be easy to run it under gdb. YMMV, of course.

 -kkm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200310/b5d17d1f/attachment.htm>


More information about the slurm-users mailing list