[slurm-users] node showing "Low socket*core count"

Noam Bernstein noam.bernstein at nrl.navy.mil
Wed Oct 10 09:27:31 MDT 2018


Hi all - I’m new to slurm, and in many ways it’s been very nice to work with, but I’m having an issue trying to properly set up thread/core/socket counts on nodes.  Basically, if I don’t specify anything except CPUs, the node is available, but doesn’t appear to know about cores and hyperthreading.  If I do try to specify that info it claims that the numbers aren’t consistent and sets the node to drain.

This is all on CentOS 7, slurm 18.08, and FastSchedule is set to 0.

First type of node, 2 x 8 core CPUs, hyperthreading on, nothing specified in slurm.conf except CPUs.  /proc/cpuinfo confirms that there are 32 “cpus”, with the expected values for physical id and core id.

From slurm.conf
NodeName=compute-2-0 NodeAddr=10.1.255.250 CPUs=32 Weight=20511700 Feature=rack-2,32CPUs

from scontrol show node
NodeName=compute-2-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=32 CPULoad=0.04
   AvailableFeatures=rack-2,32CPUs
   ActiveFeatures=rack-2,32CPUs
   Gres=(null)
   NodeAddr=10.1.255.250 NodeHostName=compute-2-0 Version=18.08
   OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
   RealMemory=257742 AllocMem=0 FreeMem=255703 Sockets=32 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=913567 Weight=20511700 Owner=N/A MCS_label=N/A
   Partitions=CLUSTER,WHEEL,n2013f
   BootTime=2018-10-10T11:06:42 SlurmdStartTime=2018-10-10T11:07:16
   CfgTRES=cpu=32,mem=257742M,billing=94
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Second type of node, 2 x 4 core CPUs, hyperthreading on.  /proc/cpuinfo confirms that there are 16 “cpus”, with the expected values for physical id and core id.

If I set the numbers of sockets/cores/threads as I think is correct (note that this is a different type of machine than the previous), 
NodeName=compute-0-0 NodeAddr=10.1.255.253 CPUs=16 Weight=20495900 Feature=rack-0,16CPUs Sockets=2 CoresPerSocket=4 ThreadsPerCore=2
I get the following
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=16 CPULoad=0.42
   AvailableFeatures=rack-0,16CPUs
   ActiveFeatures=rack-0,16CPUs
   Gres=(null)
   NodeAddr=10.1.255.253 NodeHostName=compute-0-0 Version=18.08
   OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
   RealMemory=11842 AllocMem=0 FreeMem=11335 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=275125 Weight=20495900 Owner=N/A MCS_label=N/A
   Partitions=CLUSTER,WHEEL,ib_qdr
   BootTime=2018-10-10T11:06:55 SlurmdStartTime=2018-10-10T11:07:34
   CfgTRES=cpu=16,mem=11842M,billing=18
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low socket*core count [root at 2018-10-10T10:26:14]

I feel like there are couple of things that are suspicious, but I’m not sure
1. I get the impression that slurm is supposed to be able to automatically figure out the architecture of the node, but in the first example there’s no evidence of that in the scontrol output.  
2. When I set the various architecture related parameters it claims that the numbers don’t match, even though sockets*cores*threads = 2*4*2 = 16 = CPUs

Does anyone have any idea as to what’s going on, or what other information would be useful for debugging?



										thanks,
										Noam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181010/084cbe9a/attachment.html>


More information about the slurm-users mailing list