<div dir="ltr"><div dir="ltr"><div>Yes, it's odd. <br></div><div><br></div><div><br></div><div> -kkm</div><div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 9, 2020 at 7:44 AM mike tie <<a href="mailto:mtie@carleton.edu">mtie@carleton.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div>Interesting. I'm still confused by the where slurmd -C is getting the data. When I think of where the kernel stores info about the processor, I normally think of /proc/cpuinfo. (by the way, I am running centos 7 in the vm. The vm hypervisor is VMware). /proc/cpuinfo does show 16 cores. </div></blockquote><div><br></div><div>
<div>AFAIK, the topology can be queried from /sys/devices/system/node/node*/ <<a href="https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html">https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html</a>> and
/sys/devices/system/cpu/cpu*/topology.<br></div><div><br></div><div>Whether or not Slurm in fact gets the topology from there, I do not know. The build has dependencies on both libhwloc and libnuma--that's a clue.<br></div>
</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div><div>I understand your concern over the processor speed. So I tried a different vm where I see the following specs:</div></div></blockquote><div><br></div><div>It's not even so much its speed per se, rather the way the hypervisor has finely chopped the 16 virtual CPUs into 4 sockets without hyperthreads. It makes no sense at all. I have a hunch that the other VM (the one that reports the correct CPU) should rather put them into a single socket, at least by default. But yeah, it does not answer the question where the number 10 is popping up from.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>When I increase the core count on that vm, reboot, and run slurm -C it too continues to show the lower original core count.</div></div></blockquote><div><br></div><div>Most likely it's stored somewhere on disk.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div><div>Specifically, how is slurmd -C getting that info? Maybe this is a kernel issue, but other than lscpu and /proc/cpuinfo, I don't know where to look.</div></div></blockquote><div><br></div><div>
<div>I would not bet 1 to 100 on a kernel bug. The number is most likely to come from either some stray config file, or a cache on disk. I don't know if slurmd stores any cache, never had to look (all my nodes are virtual and created and deleted on demand, thus always start fresh), but if it does, it's somewhere under /var/lib/slurm*.</div><div><br></div><div>
I thought (possibly incorrectly) that the switch -C reports the node size and CPU configuration without even looking at config files.
I would check first if it talks to the controller at all (tweak e.g. the port number in slurm.conf), and, if it does, what is the current slurmctld's idea about this node (scontrol show node=<node>, IIRC, or something very much like that).<br></div><div></div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div> Maybe I should be looking at the slurmd source?</div></div></blockquote><div><br></div><div>slurmd should be much simpler than slurmctld, and the -C query must be a straightforward, very synchronous operation. But reading sources is quite time-consuming, so I would venture into it only as a last resort. Since -C is not forking, it should be easy to run it under gdb. YMMV, of course.</div><div><br></div><div> -kkm</div><br></div></div>