[slurm-users] CPUSpecList confusion

Thu Dec 15 06:24:14 UTC 2022

Hi Paul,

as Slurm uses hwloc, I was looking into these tools a little bit deeper.
Using your script, I saw e.g. the following output on one node:

=== 31495434
CPU_IDs=21-23,25
21-23,25
=== 31495433
CPU_IDs=16-18,20
10-11,15,17
=== 31487399
CPU_IDs=15
9

That does not match your schemes and on first sight seems to be more random.

It seems, Slurm uses hwlocs logical indices, whereas cgroups uses the OS/physical indices.
According to the example above (excerpt of the full output of hwloc-ls)

       NUMANode L#1 (P#1 47GB)
       L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#3)
       L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#4)
       L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#5)
       L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#9)
       L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#10)
       L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#11)
       L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#15)
       L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#16)
       L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#17)
       L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
       L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
       L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)

That does seem to match.

and in short, to get the mapping, one can use
$> hwloc-ls --only pu
...
PU L#10 (P#19)
PU L#11 (P#20)
PU L#12 (P#3)
PU L#13 (P#4)
PU L#14 (P#5)
PU L#15 (P#9)
PU L#16 (P#10)
PU L#17 (P#11)
PU L#18 (P#15)
PU L#19 (P#16)
PU L#20 (P#17)
PU L#21 (P#21)
PU L#22 (P#22)
PU L#23 (P#23)
...

Best
Marcus

Am 14.12.2022 um 18:11 schrieb Paul Raines:
> Ugh.  Guess I cannot count.  The mapping on that last node DOES work with the "alternating" scheme where we have
> 
>   0  0
>   1  2
>   2  4
>   3  6
>   4  8
>   5 10
>   6 12
>   7 14
>   8 16
>   9 18
> 10 20
> 11 22
> 12  1
> 13  3
> 14  5
> 15  7
> 16  9
> 17 11
> 18 13
> 19 15
> 20 17
> 21 19
> 22 21
> 23 23
> 
> so CPU_IDs=8-11,20-23 does correspond to cgroup 16-23
> 
> Using the script
> 
> cd /sys/fs/cgroup/cpuset/slurm
> for d in $(find -name 'job*') ; do
>    j=$(echo $d | cut -d_ -f3)
>    echo === $j
>    scontrol -d show job $j | grep CPU_ID | cut -d' ' -f7
>    cat $d/cpuset.effective_cpus
> done
> 
> === 1967214
> CPU_IDs=8-11,20-23
> 16-23
> === 1960208
> CPU_IDs=12-19
> 1,3,5,7,9,11,13,15
> === 1966815
> CPU_IDs=0
> 0
> === 1966821
> CPU_IDs=6
> 12
> === 1966818
> CPU_IDs=3
> 6
> === 1966816
> CPU_IDs=1
> 2
> === 1966822
> CPU_IDs=7
> 14
> === 1966820
> CPU_IDs=5
> 10
> === 1966819
> CPU_IDs=4
> 8
> === 1966817
> CPU_IDs=2
> 4
> 
> On all my nodes I see just two schemes.  The alternating odd/even one above and one that is does not alternate like on this box with
> 
> CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1
> 
> === 1966495
> CPU_IDs=0-2
> 0-2
> === 1966498
> CPU_IDs=10-12
> 10-12
> === 1966502
> CPU_IDs=26-28
> 26-28
> === 1960064
> CPU_IDs=7-9,13-25
> 7-9,13-25
> === 1954480
> CPU_IDs=3-6
> 3-6
> 
> 
> On Wed, 14 Dec 2022 9:42am, Paul Raines wrote:
> 
>>
>> Yes, I see that on some of my other machines too.  So apicid is definitely not what SLURM is using but somehow just lines up that way on this one machine I have.
>>
>> I think the issue is cgroups counts starting at 0 all the cores on the first socket, then all the cores on the second socket.  But SLURM (on a two socket box) counts 0 as the first core on the first socket, 1 as the first core on the second socket, 2 as the second core on the first socket,
>> 3 as the second core on the second socket, and so on. (Looks like I am
>> wrong: see below)
>>
>> Why slurm does this instead of just using the assignments cgroups uses
>> I have no idea.  Hopefully one of the SLURM developers reads this
>> and can explain
>>
>> Looking at another SLURM node I have (where cgroups v1 is still in use
>> and HT turned off) with definition
>>
>> CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1
>>
>> I find
>>
>> [root at r440-17 ~]# egrep '^(apicid|proc)' /proc/cpuinfo  | tail -4
>> processor       : 22
>> apicid          : 22
>> processor       : 23
>> apicid          : 54
>>
>> So apicid's are NOT going to work
>>
>> # scontrol -d show job 1966817 | grep CPU_ID
>>     Nodes=r17 CPU_IDs=2 Mem=16384 GRES=
>> # cat /sys/fs/cgroup/cpuset/slurm/uid_3776056/job_1966817/cpuset.cpus
>> 4
>>
>> If Slurm has '2' this should be the second core on the first socket so should be '1' in cgroups, but it is 4 as we see above which is the fifth core on the first socket.  So I guess I was wrong above.
>>
>> But in /proc/cpuinfo the apicid for processor 4 is 2!!!  So is apicid
>> right after all?  Nope, on the same machine I have
>>
>> # scontrol -d show job 1960208 | grep CPU_ID
>>     Nodes=r17 CPU_IDs=12-19 Mem=51200 GRES=
>> # cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1960208/cpuset.cpus
>> 1,3,5,7,9,11,13,15
>>
>> and in /proc/cpuinfo the apcid for processor 12 is 16
>>
>> # scontrol -d show job 1967214 | grep CPU_ID
>>     Nodes=r17 CPU_IDs=8-11,20-23 Mem=51200 GRES=
>> # cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1967214/cpuset.cpus
>> 16-23
>>
>> I am totally lost now. Seems totally random. SLURM devs?  Any insight?
>>
>>
>> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>
>>
>>
>> On Wed, 14 Dec 2022 1:33am, Marcus Wagner wrote:
>>
>>>  Hi Paul,
>>>
>>>  sorry to say, but that has to be some coincidence on your system. I've
>>>  never seen Slurm reporting using corenumbers, which are higher than the
>>>  total number of cores.
>>>
>>>  I have e.g. a intel Platinum 8160 here. 24 Cores per Socket, no
>>>  HyperThreading activated.
>>>  Yet here the last lines of /proc/cpuinfo:
>>>
>>>  processor       : 43
>>>  apicid          : 114
>>>  processor       : 44
>>>  apicid          : 116
>>>  processor       : 45
>>>  apicid          : 118
>>>  processor       : 46
>>>  apicid          : 120
>>>  processor       : 47
>>>  apicid          : 122
>>>
>>>  Never seen Slurm reporting corenumbers for a job > 96
>>>  Nonetheless, I agree, the cores reported by Slurm mostly have nothing to
>>>  do with the cores reported e.g. by cgroups.
>>>  Since Slurm creates the cgroups, I wonder, why they report some kind of
>>>  abstract coreid, because they should know, which cores are used, as they
>>>  create the cgroups for the jobs.
>>>
>>>  Best
>>>  Marcus
>>>
>>>  Am 13.12.2022 um 16:39 schrieb Paul Raines:
>>>>
>>>>   Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo
>>>>   The first 14 cpus in /proc/cpu (procs 0-13) have apicid
>>>>   0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo
>>>>
>>>>   So after setting CpuSpecList=0,2,4,6,8,10,12,14,16,18,20,22,24,26
>>>>   in slurm.conf it appears to be doing what I want
>>>>
>>>>   $ echo $SLURM_JOB_ID
>>>>   9
>>>>   $ grep -i ^cpu /proc/self/status
>>>>   Cpus_allowed:   000f0000,000f0000
>>>>   Cpus_allowed_list:      16-19,48-51
>>>>   $ scontrol -d show job 9 | grep CPU_ID
>>>>         Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES=
>>>>
>>>>   apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo
>>>>
>>>>   Thanks
>>>>
>>>>   -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>>>
>>>>
>>>>
>>>>   On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:
>>>>
>>>>>          External Email - Use Caution
>>>>>   In the slurm.conf manual they state the CpuSpecList ids are "abstract",
>>>>>   and
>>>>>   in the CPU management docs they enforce the notion that the abstract
>>>>>   Slurm
>>>>>   IDs are not related to the Linux hardware IDs, so that is probably the
>>>>>   source of the behavior. I unfortunately don't have more information.
>>>>>
>>>>>   On Tue, Dec 13, 2022 at 9:45 AM Paul Raines
>>>>>   <raines at nmr.mgh.harvard.edu>
>>>>>   wrote:
>>>>>
>>>>>>
>>>>>>   Hmm.  Actually looks like confusion between CPU IDs on the system
>>>>>>   and what SLURM thinks the IDs are
>>>>>>
>>>>>>   # scontrol -d show job 8
>>>>>>   ...
>>>>>>         Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES=
>>>>>>   ...
>>>>>>
>>>>>>   # cat
>>>>>>   /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective
>>>>>>   7-10,39-42
>>>>>>
>>>>>>
>>>>>>   -- Paul Raines
>>>>>>   (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>>>>
>>>>>>
>>>>>>
>>>>>>   On Tue, 13 Dec 2022 9:40am, Paul Raines wrote:
>>>>>>
>>>>>> > >   Oh but that does explain the CfgTRES=cpu=14.  With the CpuSpecList
>>>>>> >   below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense.
>>>>>> > >   The issue remains that thought the number of cpus in CpuSpecList
>>>>>> >   is taken into account, the exact IDs seem to be ignored.
>>>>>> > > >   -- Paul Raines >   (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>>>> > > > >   On Tue, 13 Dec 2022 9:34am, Paul Raines wrote:
>>>>>> > >> >>    I have tried it both ways with the same result.  The assigned CPUs
>>>>>> >>    will be both in and out of the range given to CpuSpecList
>>>>>> >> >>    I tried setting using commas instead of ranges so used
>>>>>> >> >>    CpuSpecList=0,1,2,3,4,5,6,7,8,9,10,11,12,13
>>>>>> >> >>    But still does not work
>>>>>> >> >>    $ srun -p basic -N 1 --ntasks-per-node=1 --mem=25G \
>>>>>> >>    --time=10:00:00 --cpus-per-task=8 --pty /bin/bash
>>>>>> >>    $ grep -i ^cpu /proc/self/status
>>>>>> >>    Cpus_allowed:   00000780,00000780
>>>>>> >>    Cpus_allowed_list:      7-10,39-42
>>>>>> >> >> >>    -- Paul Raines >>  (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>>>> >> >> >> >>    On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote:
>>>>>> >> >>>     Hi Paul,
>>>>>> >>> >>>     Nodename=foobar \
>>>>>> >>>>        CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>>>>> >>>>        ThreadsPerCore=2
>>>>>> >>>>        \
>>>>>> >>>>        RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \
>>>>>> >>>>        TmpDisk=6000000 Gres=gpu:nvidia_rtx_a6000:1
>>>>>> >>>> >>>>     The slurm.conf also has:
>>>>>> >>>> >>>>     ProctrackType=proctrack/cgroup
>>>>>> >>>>     TaskPlugin=task/affinity,task/cgroup
>>>>>> >>>>     TaskPluginParam=Cores,*SlurmdOf**fSpec*,Verbose
>>>>>> >>>> >>> >>>     Doesn't setting SlurmdOffSpec tell Slurmd that is should NOT use >>>  the
>>>>>> >>>     CPUs
>>>>>> >>>     in the spec list? (
>>>>>> >>> >>>  https://secure-web.cisco.com/1V9Fskh4lCAx_XrdlCr8o1EtnePELf-1YK4TerT47ktLxy3fO9FaIpaGXVA8ODhMAdhmXSqToQstwAilA71r7z1Q4jDqPSKEsJQNUhJYYRtxFnZIO49QxsYrVo9c3ExH89cIk_t7H5dtGEjpme2LFKm23Z52yK-xZ3fEl_LyK61uCzkas6GKykzPCPyoNXaFgs32Ct2tDIVL8vI6JW1_-1uQ9gUyWmm24xJoBxLEui7tSTVwMtiVRu8C7pU1nJ8qr6ghBlxrqx-wQiRP4XBCjhPARHa2KBqkUogjEVRAe3WdAbbYBxtXeVsWjqNGmjSVA/https%3A%2F%2Fslurm.schedmd.com%2Fslurm.conf.html%23OPT_SlurmdOffSpec)
>>>>>> >>>     In this case, I believe it uses what is left, which is the 0-13. >>>  We
>>>>>>   are
>>>>>> >>>     just starting to work on this ourselves, and were looking at >>>  this
>>>>>> >>>     setting.
>>>>>> >>> >>>     Best,
>>>>>> >>> >>>     -Sean
>>>>>> >>> >> >
>>>>>>   The information in this e-mail is intended only for the person to whom
>>>>>>   it
>>>>>>   is addressed.  If you believe this e-mail was sent to you in error and
>>>>>>   the
>>>>>>   e-mail contains patient information, please contact the Mass General
>>>>>>   Brigham Compliance HelpLine at
>>>>>>   https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline
>>>>>>   <
>>>>>>   https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
>>>>>>   .
>>>>>>   Please note that this e-mail is not secure (encrypted).  If you do not
>>>>>>   wish to continue communication over unencrypted e-mail, please notify
>>>>>>   the
>>>>>>   sender of this message immediately.  Continuing to send or respond to
>>>>>>   e-mail after receiving this message means you understand and accept
>>>>>>   this
>>>>>>   risk and wish to continue to communicate over unencrypted e-mail.
>>>>>>
>>>>>>
>>>>   The information in this e-mail is intended only for the person to whom
>>>>   it
>>>>   is addressed.  If you believe this e-mail was sent to you in error and
>>>>   the
>>>>   e-mail contains patient information, please contact the Mass General
>>>>   Brigham Compliance HelpLine at
>>>>   https://www.massgeneralbrigham.org/complianceline
>>>>   <https://www.massgeneralbrigham.org/complianceline> .
>>>>   Please note that this e-mail is not secure (encrypted).  If you do not
>>>>   wish to continue communication over unencrypted e-mail, please notify
>>>>   the
>>>>   sender of this message immediately.  Continuing to send or respond to
>>>>   e-mail after receiving this message means you understand and accept this
>>>>   risk and wish to continue to communicate over unencrypted e-mail.
>>>>
>>>
>>>  --
>>>  Dipl.-Inf. Marcus Wagner
>>>
>>>  IT Center
>>>  Gruppe: Server, Storage, HPC
>>>  Abteilung: Systeme und Betrieb
>>>  RWTH Aachen University
>>>  Seffenter Weg 23
>>>  52074 Aachen
>>>  Tel: +49 241 80-24383
>>>  Fax: +49 241 80-624383
>>>  wagner at itc.rwth-aachen.de
>>>  www.itc.rwth-aachen.de
>>>
>>>  Social Media Kanäle des IT Centers:
>>>  https://blog.rwth-aachen.de/itc/
>>>  https://www.facebook.com/itcenterrwth
>>>  https://www.linkedin.com/company/itcenterrwth
>>>  https://twitter.com/ITCenterRWTH
>>>  https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
>>>
>>
> The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
> Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

-- 
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Server, Storage, HPC
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221215/07a01c50/attachment-0001.bin>