[slurm-users] Trouble disabling core specialization

Guertin, David S. guertin at middlebury.edu
Thu Jun 27 13:23:57 UTC 2019


Hello all,


I'm trying to turn off core specialization in my cluster by setting CoreSpecCount=0, but checking with scontrol does not show my changes. If I set CoreSpec=1 or CoreSpecCount=2, or anything except 0, the changes are applied correctly. But when I set it to 0, no change is applied -- it remains on whatever the previous number was.

with CoreSpecCount=1:

---------------------------------------
# scontrol show node node016
NodeName=node016 Arch=x86_64 CoresPerSocket=18⋅
   CPUAlloc=0 CPUTot=72 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node016 NodeHostName=node016⋅
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018⋅
   RealMemory=95306 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   CoreSpecCount=1 CPUSpecList=70-71⋅
   State=IDLE ThreadsPerCore=2 TmpDisk=2038 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=test⋅
   BootTime=2019-06-19T08:41:49 SlurmdStartTime=2019-06-27T09:06:26
   CfgTRES=cpu=72,mem=95306M,billing=72
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
---------------------------------------

That is correct.

with CoreSpecCount=0:

---------------------------------------
# scontrol show node node016
NodeName=node016 Arch=x86_64 CoresPerSocket=18⋅
   CPUAlloc=0 CPUTot=72 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node016 NodeHostName=node016⋅
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018⋅
   RealMemory=95306 AllocMem=0 FreeMem=92773 Sockets=2 Boards=1
   CoreSpecCount=1 CPUSpecList=70-71⋅
   State=IDLE ThreadsPerCore=2 TmpDisk=2038 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=test⋅
   BootTime=2019-06-19T08:41:49 SlurmdStartTime=2019-06-27T09:06:26
   CfgTRES=cpu=72,mem=95306M,billing=72
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
---------------------------------------

That is wrong. It's exactly the same -- CoreSpecCount still shows 1.

The weird thing is that if I run slurmd in the foreground in verbose mode on the node with "slurmd -cDvvf /etc/slurm/slurm.conf", the change appears to be recognized.

Results with CoreSpecCount=1:

---------------------------------------
slurmd: got reconfigure request
slurmd: all threads complete
slurmd: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
slurmd: debug:  Ignoring obsolete CacheGroups option.
slurmd: debug:  Log file re-opened
slurmd: debug:  CPUs:72 Boards:1 Sockets:2 CoresPerSocket:18 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm' already exists
slurmd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/system' already exists
slurmd: debug:  system cgroup: system cpuset cgroup initialized
slurmd: Resource spec: Reserved abstract CPU IDs: 70-71
slurmd: Resource spec: Reserved machine CPU IDs: 35,71
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
---------------------------------------

Results with CoreSpecCount=0:

---------------------------------------
slurmd: got reconfigure request
slurmd: all threads complete
slurmd: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
slurmd: debug:  Ignoring obsolete CacheGroups option.
slurmd: debug:  Log file re-opened
slurmd: debug:  CPUs:72 Boards:1 Sockets:2 CoresPerSocket:18 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
---------------------------------------

The reserved CPUs have been removed as they should be. So why does scontrol still show the incorrect value (and jobs still do not run on those cores)?

Dave


David Guertin

Information Technology Services
Middlebury College
700 Exchange St.
Middlebury, VT 05753
(802)443-3143
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190627/f333c109/attachment.html>


More information about the slurm-users mailing list