[slurm-users] CPUSpecList confusion
Marcus Wagner
wagner at itc.rwth-aachen.de
Wed Dec 14 06:33:23 UTC 2022
Hi Paul,
sorry to say, but that has to be some coincidence on your system. I've never seen Slurm reporting using corenumbers, which are higher than the total number of cores.
I have e.g. a intel Platinum 8160 here. 24 Cores per Socket, no HyperThreading activated.
Yet here the last lines of /proc/cpuinfo:
processor : 43
apicid : 114
processor : 44
apicid : 116
processor : 45
apicid : 118
processor : 46
apicid : 120
processor : 47
apicid : 122
Never seen Slurm reporting corenumbers for a job > 96
Nonetheless, I agree, the cores reported by Slurm mostly have nothing to do with the cores reported e.g. by cgroups.
Since Slurm creates the cgroups, I wonder, why they report some kind of abstract coreid, because they should know, which cores are used, as they create the cgroups for the jobs.
Best
Marcus
Am 13.12.2022 um 16:39 schrieb Paul Raines:
>
> Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo
> The first 14 cpus in /proc/cpu (procs 0-13) have apicid 0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo
>
> So after setting CpuSpecList=0,2,4,6,8,10,12,14,16,18,20,22,24,26
> in slurm.conf it appears to be doing what I want
>
> $ echo $SLURM_JOB_ID
> 9
> $ grep -i ^cpu /proc/self/status
> Cpus_allowed: 000f0000,000f0000
> Cpus_allowed_list: 16-19,48-51
> $ scontrol -d show job 9 | grep CPU_ID
> Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES=
>
> apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo
>
> Thanks
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:
>
>> External Email - Use Caution
>> In the slurm.conf manual they state the CpuSpecList ids are "abstract", and
>> in the CPU management docs they enforce the notion that the abstract Slurm
>> IDs are not related to the Linux hardware IDs, so that is probably the
>> source of the behavior. I unfortunately don't have more information.
>>
>> On Tue, Dec 13, 2022 at 9:45 AM Paul Raines <raines at nmr.mgh.harvard.edu>
>> wrote:
>>
>>>
>>> Hmm. Actually looks like confusion between CPU IDs on the system
>>> and what SLURM thinks the IDs are
>>>
>>> # scontrol -d show job 8
>>> ...
>>> Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES=
>>> ...
>>>
>>> # cat
>>> /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective
>>> 7-10,39-42
>>>
>>>
>>> -- Paul Raines (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>>
>>>
>>>
>>> On Tue, 13 Dec 2022 9:40am, Paul Raines wrote:
>>>
>>> >
>>> > Oh but that does explain the CfgTRES=cpu=14. With the CpuSpecList
>>> > below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense.
>>> >
>>> > The issue remains that thought the number of cpus in CpuSpecList
>>> > is taken into account, the exact IDs seem to be ignored.
>>> >
>>> >
>>> > -- Paul Raines (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>> >
>>> >
>>> >
>>> > On Tue, 13 Dec 2022 9:34am, Paul Raines wrote:
>>> >
>>> >>
>>> >> I have tried it both ways with the same result. The assigned CPUs
>>> >> will be both in and out of the range given to CpuSpecList
>>> >>
>>> >> I tried setting using commas instead of ranges so used
>>> >>
>>> >> CpuSpecList=0,1,2,3,4,5,6,7,8,9,10,11,12,13
>>> >>
>>> >> But still does not work
>>> >>
>>> >> $ srun -p basic -N 1 --ntasks-per-node=1 --mem=25G \
>>> >> --time=10:00:00 --cpus-per-task=8 --pty /bin/bash
>>> >> $ grep -i ^cpu /proc/self/status
>>> >> Cpus_allowed: 00000780,00000780
>>> >> Cpus_allowed_list: 7-10,39-42
>>> >>
>>> >>
>>> >> -- Paul Raines (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>> >>
>>> >>
>>> >>
>>> >> On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote:
>>> >>
>>> >>> Hi Paul,
>>> >>>
>>> >>> Nodename=foobar \
>>> >>>> CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> >>>> ThreadsPerCore=2
>>> >>>> \
>>> >>>> RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \
>>> >>>> TmpDisk=6000000 Gres=gpu:nvidia_rtx_a6000:1
>>> >>>>
>>> >>>> The slurm.conf also has:
>>> >>>>
>>> >>>> ProctrackType=proctrack/cgroup
>>> >>>> TaskPlugin=task/affinity,task/cgroup
>>> >>>> TaskPluginParam=Cores,*SlurmdOf**fSpec*,Verbose
>>> >>>>
>>> >>>
>>> >>> Doesn't setting SlurmdOffSpec tell Slurmd that is should NOT use the
>>> >>> CPUs
>>> >>> in the spec list? (
>>> >>> https://secure-web.cisco.com/1V9Fskh4lCAx_XrdlCr8o1EtnePELf-1YK4TerT47ktLxy3fO9FaIpaGXVA8ODhMAdhmXSqToQstwAilA71r7z1Q4jDqPSKEsJQNUhJYYRtxFnZIO49QxsYrVo9c3ExH89cIk_t7H5dtGEjpme2LFKm23Z52yK-xZ3fEl_LyK61uCzkas6GKykzPCPyoNXaFgs32Ct2tDIVL8vI6JW1_-1uQ9gUyWmm24xJoBxLEui7tSTVwMtiVRu8C7pU1nJ8qr6ghBlxrqx-wQiRP4XBCjhPARHa2KBqkUogjEVRAe3WdAbbYBxtXeVsWjqNGmjSVA/https%3A%2F%2Fslurm.schedmd.com%2Fslurm.conf.html%23OPT_SlurmdOffSpec)
>>> >>> In this case, I believe it uses what is left, which is the 0-13. We
>>> are
>>> >>> just starting to work on this ourselves, and were looking at this
>>> >>> setting.
>>> >>>
>>> >>> Best,
>>> >>>
>>> >>> -Sean
>>> >>>
>>> >>
>>> >
>>> The information in this e-mail is intended only for the person to whom it
>>> is addressed. If you believe this e-mail was sent to you in error and the
>>> e-mail contains patient information, please contact the Mass General
>>> Brigham Compliance HelpLine at
>>> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline <
>>> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline> .
>>> Please note that this e-mail is not secure (encrypted). If you do not
>>> wish to continue communication over unencrypted e-mail, please notify the
>>> sender of this message immediately. Continuing to send or respond to
>>> e-mail after receiving this message means you understand and accept this
>>> risk and wish to continue to communicate over unencrypted e-mail.
>>>
>>>
> The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
> Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
>
--
Dipl.-Inf. Marcus Wagner
IT Center
Gruppe: Server, Storage, HPC
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221214/905f4bea/attachment.bin>
More information about the slurm-users
mailing list