[slurm-users] CPUSpecList confusion
Marcus Wagner
wagner at itc.rwth-aachen.de
Thu Dec 15 06:24:14 UTC 2022
Hi Paul,
as Slurm uses hwloc, I was looking into these tools a little bit deeper.
Using your script, I saw e.g. the following output on one node:
=== 31495434
CPU_IDs=21-23,25
21-23,25
=== 31495433
CPU_IDs=16-18,20
10-11,15,17
=== 31487399
CPU_IDs=15
9
That does not match your schemes and on first sight seems to be more random.
It seems, Slurm uses hwlocs logical indices, whereas cgroups uses the OS/physical indices.
According to the example above (excerpt of the full output of hwloc-ls)
NUMANode L#1 (P#1 47GB)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#3)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#4)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#5)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#9)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#10)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#11)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#15)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#16)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#17)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
That does seem to match.
and in short, to get the mapping, one can use
$> hwloc-ls --only pu
...
PU L#10 (P#19)
PU L#11 (P#20)
PU L#12 (P#3)
PU L#13 (P#4)
PU L#14 (P#5)
PU L#15 (P#9)
PU L#16 (P#10)
PU L#17 (P#11)
PU L#18 (P#15)
PU L#19 (P#16)
PU L#20 (P#17)
PU L#21 (P#21)
PU L#22 (P#22)
PU L#23 (P#23)
...
Best
Marcus
Am 14.12.2022 um 18:11 schrieb Paul Raines:
> Ugh. Guess I cannot count. The mapping on that last node DOES work with the "alternating" scheme where we have
>
> 0 0
> 1 2
> 2 4
> 3 6
> 4 8
> 5 10
> 6 12
> 7 14
> 8 16
> 9 18
> 10 20
> 11 22
> 12 1
> 13 3
> 14 5
> 15 7
> 16 9
> 17 11
> 18 13
> 19 15
> 20 17
> 21 19
> 22 21
> 23 23
>
> so CPU_IDs=8-11,20-23 does correspond to cgroup 16-23
>
> Using the script
>
> cd /sys/fs/cgroup/cpuset/slurm
> for d in $(find -name 'job*') ; do
> j=$(echo $d | cut -d_ -f3)
> echo === $j
> scontrol -d show job $j | grep CPU_ID | cut -d' ' -f7
> cat $d/cpuset.effective_cpus
> done
>
> === 1967214
> CPU_IDs=8-11,20-23
> 16-23
> === 1960208
> CPU_IDs=12-19
> 1,3,5,7,9,11,13,15
> === 1966815
> CPU_IDs=0
> 0
> === 1966821
> CPU_IDs=6
> 12
> === 1966818
> CPU_IDs=3
> 6
> === 1966816
> CPU_IDs=1
> 2
> === 1966822
> CPU_IDs=7
> 14
> === 1966820
> CPU_IDs=5
> 10
> === 1966819
> CPU_IDs=4
> 8
> === 1966817
> CPU_IDs=2
> 4
>
> On all my nodes I see just two schemes. The alternating odd/even one above and one that is does not alternate like on this box with
>
> CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1
>
> === 1966495
> CPU_IDs=0-2
> 0-2
> === 1966498
> CPU_IDs=10-12
> 10-12
> === 1966502
> CPU_IDs=26-28
> 26-28
> === 1960064
> CPU_IDs=7-9,13-25
> 7-9,13-25
> === 1954480
> CPU_IDs=3-6
> 3-6
>
>
> On Wed, 14 Dec 2022 9:42am, Paul Raines wrote:
>
>>
>> Yes, I see that on some of my other machines too. So apicid is definitely not what SLURM is using but somehow just lines up that way on this one machine I have.
>>
>> I think the issue is cgroups counts starting at 0 all the cores on the first socket, then all the cores on the second socket. But SLURM (on a two socket box) counts 0 as the first core on the first socket, 1 as the first core on the second socket, 2 as the second core on the first socket,
>> 3 as the second core on the second socket, and so on. (Looks like I am
>> wrong: see below)
>>
>> Why slurm does this instead of just using the assignments cgroups uses
>> I have no idea. Hopefully one of the SLURM developers reads this
>> and can explain
>>
>> Looking at another SLURM node I have (where cgroups v1 is still in use
>> and HT turned off) with definition
>>
>> CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1
>>
>> I find
>>
>> [root at r440-17 ~]# egrep '^(apicid|proc)' /proc/cpuinfo | tail -4
>> processor : 22
>> apicid : 22
>> processor : 23
>> apicid : 54
>>
>> So apicid's are NOT going to work
>>
>> # scontrol -d show job 1966817 | grep CPU_ID
>> Nodes=r17 CPU_IDs=2 Mem=16384 GRES=
>> # cat /sys/fs/cgroup/cpuset/slurm/uid_3776056/job_1966817/cpuset.cpus
>> 4
>>
>> If Slurm has '2' this should be the second core on the first socket so should be '1' in cgroups, but it is 4 as we see above which is the fifth core on the first socket. So I guess I was wrong above.
>>
>> But in /proc/cpuinfo the apicid for processor 4 is 2!!! So is apicid
>> right after all? Nope, on the same machine I have
>>
>> # scontrol -d show job 1960208 | grep CPU_ID
>> Nodes=r17 CPU_IDs=12-19 Mem=51200 GRES=
>> # cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1960208/cpuset.cpus
>> 1,3,5,7,9,11,13,15
>>
>> and in /proc/cpuinfo the apcid for processor 12 is 16
>>
>> # scontrol -d show job 1967214 | grep CPU_ID
>> Nodes=r17 CPU_IDs=8-11,20-23 Mem=51200 GRES=
>> # cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1967214/cpuset.cpus
>> 16-23
>>
>> I am totally lost now. Seems totally random. SLURM devs? Any insight?
>>
>>
>> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>
>>
>>
>> On Wed, 14 Dec 2022 1:33am, Marcus Wagner wrote:
>>
>>> Hi Paul,
>>>
>>> sorry to say, but that has to be some coincidence on your system. I've
>>> never seen Slurm reporting using corenumbers, which are higher than the
>>> total number of cores.
>>>
>>> I have e.g. a intel Platinum 8160 here. 24 Cores per Socket, no
>>> HyperThreading activated.
>>> Yet here the last lines of /proc/cpuinfo:
>>>
>>> processor : 43
>>> apicid : 114
>>> processor : 44
>>> apicid : 116
>>> processor : 45
>>> apicid : 118
>>> processor : 46
>>> apicid : 120
>>> processor : 47
>>> apicid : 122
>>>
>>> Never seen Slurm reporting corenumbers for a job > 96
>>> Nonetheless, I agree, the cores reported by Slurm mostly have nothing to
>>> do with the cores reported e.g. by cgroups.
>>> Since Slurm creates the cgroups, I wonder, why they report some kind of
>>> abstract coreid, because they should know, which cores are used, as they
>>> create the cgroups for the jobs.
>>>
>>> Best
>>> Marcus
>>>
>>> Am 13.12.2022 um 16:39 schrieb Paul Raines:
>>>>
>>>> Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo
>>>> The first 14 cpus in /proc/cpu (procs 0-13) have apicid
>>>> 0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo
>>>>
>>>> So after setting CpuSpecList=0,2,4,6,8,10,12,14,16,18,20,22,24,26
>>>> in slurm.conf it appears to be doing what I want
>>>>
>>>> $ echo $SLURM_JOB_ID
>>>> 9
>>>> $ grep -i ^cpu /proc/self/status
>>>> Cpus_allowed: 000f0000,000f0000
>>>> Cpus_allowed_list: 16-19,48-51
>>>> $ scontrol -d show job 9 | grep CPU_ID
>>>> Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES=
>>>>
>>>> apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo
>>>>
>>>> Thanks
>>>>
>>>> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>>>
>>>>
>>>>
>>>> On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:
>>>>
>>>>> External Email - Use Caution
>>>>> In the slurm.conf manual they state the CpuSpecList ids are "abstract",
>>>>> and
>>>>> in the CPU management docs they enforce the notion that the abstract
>>>>> Slurm
>>>>> IDs are not related to the Linux hardware IDs, so that is probably the
>>>>> source of the behavior. I unfortunately don't have more information.
>>>>>
>>>>> On Tue, Dec 13, 2022 at 9:45 AM Paul Raines
>>>>> <raines at nmr.mgh.harvard.edu>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hmm. Actually looks like confusion between CPU IDs on the system
>>>>>> and what SLURM thinks the IDs are
>>>>>>
>>>>>> # scontrol -d show job 8
>>>>>> ...
>>>>>> Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES=
>>>>>> ...
>>>>>>
>>>>>> # cat
>>>>>> /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective
>>>>>> 7-10,39-42
>>>>>>
>>>>>>
>>>>>> -- Paul Raines
>>>>>> (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 13 Dec 2022 9:40am, Paul Raines wrote:
>>>>>>
>>>>>> > > Oh but that does explain the CfgTRES=cpu=14. With the CpuSpecList
>>>>>> > below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense.
>>>>>> > > The issue remains that thought the number of cpus in CpuSpecList
>>>>>> > is taken into account, the exact IDs seem to be ignored.
>>>>>> > > > -- Paul Raines > (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>>>> > > > > On Tue, 13 Dec 2022 9:34am, Paul Raines wrote:
>>>>>> > >> >> I have tried it both ways with the same result. The assigned CPUs
>>>>>> >> will be both in and out of the range given to CpuSpecList
>>>>>> >> >> I tried setting using commas instead of ranges so used
>>>>>> >> >> CpuSpecList=0,1,2,3,4,5,6,7,8,9,10,11,12,13
>>>>>> >> >> But still does not work
>>>>>> >> >> $ srun -p basic -N 1 --ntasks-per-node=1 --mem=25G \
>>>>>> >> --time=10:00:00 --cpus-per-task=8 --pty /bin/bash
>>>>>> >> $ grep -i ^cpu /proc/self/status
>>>>>> >> Cpus_allowed: 00000780,00000780
>>>>>> >> Cpus_allowed_list: 7-10,39-42
>>>>>> >> >> >> -- Paul Raines >> (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>>>> >> >> >> >> On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote:
>>>>>> >> >>> Hi Paul,
>>>>>> >>> >>> Nodename=foobar \
>>>>>> >>>> CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>>>>> >>>> ThreadsPerCore=2
>>>>>> >>>> \
>>>>>> >>>> RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \
>>>>>> >>>> TmpDisk=6000000 Gres=gpu:nvidia_rtx_a6000:1
>>>>>> >>>> >>>> The slurm.conf also has:
>>>>>> >>>> >>>> ProctrackType=proctrack/cgroup
>>>>>> >>>> TaskPlugin=task/affinity,task/cgroup
>>>>>> >>>> TaskPluginParam=Cores,*SlurmdOf**fSpec*,Verbose
>>>>>> >>>> >>> >>> Doesn't setting SlurmdOffSpec tell Slurmd that is should NOT use >>> the
>>>>>> >>> CPUs
>>>>>> >>> in the spec list? (
>>>>>> >>> >>> https://secure-web.cisco.com/1V9Fskh4lCAx_XrdlCr8o1EtnePELf-1YK4TerT47ktLxy3fO9FaIpaGXVA8ODhMAdhmXSqToQstwAilA71r7z1Q4jDqPSKEsJQNUhJYYRtxFnZIO49QxsYrVo9c3ExH89cIk_t7H5dtGEjpme2LFKm23Z52yK-xZ3fEl_LyK61uCzkas6GKykzPCPyoNXaFgs32Ct2tDIVL8vI6JW1_-1uQ9gUyWmm24xJoBxLEui7tSTVwMtiVRu8C7pU1nJ8qr6ghBlxrqx-wQiRP4XBCjhPARHa2KBqkUogjEVRAe3WdAbbYBxtXeVsWjqNGmjSVA/https%3A%2F%2Fslurm.schedmd.com%2Fslurm.conf.html%23OPT_SlurmdOffSpec)
>>>>>> >>> In this case, I believe it uses what is left, which is the 0-13. >>> We
>>>>>> are
>>>>>> >>> just starting to work on this ourselves, and were looking at >>> this
>>>>>> >>> setting.
>>>>>> >>> >>> Best,
>>>>>> >>> >>> -Sean
>>>>>> >>> >> >
>>>>>> The information in this e-mail is intended only for the person to whom
>>>>>> it
>>>>>> is addressed. If you believe this e-mail was sent to you in error and
>>>>>> the
>>>>>> e-mail contains patient information, please contact the Mass General
>>>>>> Brigham Compliance HelpLine at
>>>>>> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline
>>>>>> <
>>>>>> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
>>>>>> .
>>>>>> Please note that this e-mail is not secure (encrypted). If you do not
>>>>>> wish to continue communication over unencrypted e-mail, please notify
>>>>>> the
>>>>>> sender of this message immediately. Continuing to send or respond to
>>>>>> e-mail after receiving this message means you understand and accept
>>>>>> this
>>>>>> risk and wish to continue to communicate over unencrypted e-mail.
>>>>>>
>>>>>>
>>>> The information in this e-mail is intended only for the person to whom
>>>> it
>>>> is addressed. If you believe this e-mail was sent to you in error and
>>>> the
>>>> e-mail contains patient information, please contact the Mass General
>>>> Brigham Compliance HelpLine at
>>>> https://www.massgeneralbrigham.org/complianceline
>>>> <https://www.massgeneralbrigham.org/complianceline> .
>>>> Please note that this e-mail is not secure (encrypted). If you do not
>>>> wish to continue communication over unencrypted e-mail, please notify
>>>> the
>>>> sender of this message immediately. Continuing to send or respond to
>>>> e-mail after receiving this message means you understand and accept this
>>>> risk and wish to continue to communicate over unencrypted e-mail.
>>>>
>>>
>>> --
>>> Dipl.-Inf. Marcus Wagner
>>>
>>> IT Center
>>> Gruppe: Server, Storage, HPC
>>> Abteilung: Systeme und Betrieb
>>> RWTH Aachen University
>>> Seffenter Weg 23
>>> 52074 Aachen
>>> Tel: +49 241 80-24383
>>> Fax: +49 241 80-624383
>>> wagner at itc.rwth-aachen.de
>>> www.itc.rwth-aachen.de
>>>
>>> Social Media Kanäle des IT Centers:
>>> https://blog.rwth-aachen.de/itc/
>>> https://www.facebook.com/itcenterrwth
>>> https://www.linkedin.com/company/itcenterrwth
>>> https://twitter.com/ITCenterRWTH
>>> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
>>>
>>
> The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
> Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
--
Dipl.-Inf. Marcus Wagner
IT Center
Gruppe: Server, Storage, HPC
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221215/07a01c50/attachment-0001.bin>
More information about the slurm-users
mailing list