[slurm-users] node showing "Low socket*core count"

Eli V eliventer at gmail.com
Wed Oct 10 09:40:59 MDT 2018


Don't think you need CPUs in slurm.conf for the node def, just
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 for example, and the
slurmctld does the math for # cpus. Also slurmd -C on the nodes is
very useful to see what's being autodetected.
On Wed, Oct 10, 2018 at 11:34 AM Noam Bernstein
<noam.bernstein at nrl.navy.mil> wrote:
>
> Hi all - I’m new to slurm, and in many ways it’s been very nice to work with, but I’m having an issue trying to properly set up thread/core/socket counts on nodes.  Basically, if I don’t specify anything except CPUs, the node is available, but doesn’t appear to know about cores and hyperthreading.  If I do try to specify that info it claims that the numbers aren’t consistent and sets the node to drain.
>
> This is all on CentOS 7, slurm 18.08, and FastSchedule is set to 0.
>
> First type of node, 2 x 8 core CPUs, hyperthreading on, nothing specified in slurm.conf except CPUs.  /proc/cpuinfo confirms that there are 32 “cpus”, with the expected values for physical id and core id.
>
> From slurm.conf
>
> NodeName=compute-2-0 NodeAddr=10.1.255.250 CPUs=32 Weight=20511700 Feature=rack-2,32CPUs
>
> from scontrol show node
>
> NodeName=compute-2-0 Arch=x86_64 CoresPerSocket=1
>    CPUAlloc=0 CPUTot=32 CPULoad=0.04
>    AvailableFeatures=rack-2,32CPUs
>    ActiveFeatures=rack-2,32CPUs
>    Gres=(null)
>    NodeAddr=10.1.255.250 NodeHostName=compute-2-0 Version=18.08
>    OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
>    RealMemory=257742 AllocMem=0 FreeMem=255703 Sockets=32 Boards=1
>    State=IDLE ThreadsPerCore=1 TmpDisk=913567 Weight=20511700 Owner=N/A MCS_label=N/A
>    Partitions=CLUSTER,WHEEL,n2013f
>    BootTime=2018-10-10T11:06:42 SlurmdStartTime=2018-10-10T11:07:16
>    CfgTRES=cpu=32,mem=257742M,billing=94
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> Second type of node, 2 x 4 core CPUs, hyperthreading on.  /proc/cpuinfo confirms that there are 16 “cpus”, with the expected values for physical id and core id.
>
> If I set the numbers of sockets/cores/threads as I think is correct (note that this is a different type of machine than the previous),
>
> NodeName=compute-0-0 NodeAddr=10.1.255.253 CPUs=16 Weight=20495900 Feature=rack-0,16CPUs Sockets=2 CoresPerSocket=4 ThreadsPerCore=2
>
> I get the following
>
> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=4
>    CPUAlloc=0 CPUTot=16 CPULoad=0.42
>    AvailableFeatures=rack-0,16CPUs
>    ActiveFeatures=rack-0,16CPUs
>    Gres=(null)
>    NodeAddr=10.1.255.253 NodeHostName=compute-0-0 Version=18.08
>    OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
>    RealMemory=11842 AllocMem=0 FreeMem=11335 Sockets=2 Boards=1
>    State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=275125 Weight=20495900 Owner=N/A MCS_label=N/A
>    Partitions=CLUSTER,WHEEL,ib_qdr
>    BootTime=2018-10-10T11:06:55 SlurmdStartTime=2018-10-10T11:07:34
>    CfgTRES=cpu=16,mem=11842M,billing=18
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Low socket*core count [root at 2018-10-10T10:26:14]
>
>
> I feel like there are couple of things that are suspicious, but I’m not sure
> 1. I get the impression that slurm is supposed to be able to automatically figure out the architecture of the node, but in the first example there’s no evidence of that in the scontrol output.
> 2. When I set the various architecture related parameters it claims that the numbers don’t match, even though sockets*cores*threads = 2*4*2 = 16 = CPUs
>
> Does anyone have any idea as to what’s going on, or what other information would be useful for debugging?
>
>
>
> thanks,
> Noam



More information about the slurm-users mailing list