[slurm-users] CPUSpecList confusion

Tue Dec 13 15:39:01 UTC 2022

Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo
The first 14 cpus in /proc/cpu (procs 0-13) have apicid 
0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo

So after setting CpuSpecList=0,2,4,6,8,10,12,14,16,18,20,22,24,26
in slurm.conf it appears to be doing what I want

$ echo $SLURM_JOB_ID
9
$ grep -i ^cpu /proc/self/status
Cpus_allowed:   000f0000,000f0000
Cpus_allowed_list:      16-19,48-51
$ scontrol -d show job 9 | grep CPU_ID
      Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES=

apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo

Thanks

-- Paul Raines (http://help.nmr.mgh.harvard.edu)


On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:

>        External Email - Use Caution 
>
> In the slurm.conf manual they state the CpuSpecList ids are "abstract", and
> in the CPU management docs they enforce the notion that the abstract Slurm
> IDs are not related to the Linux hardware IDs, so that is probably the
> source of the behavior. I unfortunately don't have more information.
>
> On Tue, Dec 13, 2022 at 9:45 AM Paul Raines <raines at nmr.mgh.harvard.edu>
> wrote:
>
>>
>> Hmm.  Actually looks like confusion between CPU IDs on the system
>> and what SLURM thinks the IDs are
>>
>> # scontrol -d show job 8
>> ...
>>       Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES=
>> ...
>>
>> # cat
>> /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective
>> 7-10,39-42
>>
>>
>> -- Paul Raines (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>>
>>
>>
>> On Tue, 13 Dec 2022 9:40am, Paul Raines wrote:
>>
>> >
>> > Oh but that does explain the CfgTRES=cpu=14.  With the CpuSpecList
>> > below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense.
>> >
>> > The issue remains that thought the number of cpus in CpuSpecList
>> > is taken into account, the exact IDs seem to be ignored.
>> >
>> >
>> > -- Paul Raines (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>> >
>> >
>> >
>> > On Tue, 13 Dec 2022 9:34am, Paul Raines wrote:
>> >
>> >>
>> >>  I have tried it both ways with the same result.  The assigned CPUs
>> >>  will be both in and out of the range given to CpuSpecList
>> >>
>> >>  I tried setting using commas instead of ranges so used
>> >>
>> >>  CpuSpecList=0,1,2,3,4,5,6,7,8,9,10,11,12,13
>> >>
>> >>  But still does not work
>> >>
>> >>  $ srun -p basic -N 1 --ntasks-per-node=1 --mem=25G \
>> >>  --time=10:00:00 --cpus-per-task=8 --pty /bin/bash
>> >>  $ grep -i ^cpu /proc/self/status
>> >>  Cpus_allowed:   00000780,00000780
>> >>  Cpus_allowed_list:      7-10,39-42
>> >>
>> >>
>> >>  -- Paul Raines (http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu)
>> >>
>> >>
>> >>
>> >>  On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote:
>> >>
>> >>>   Hi Paul,
>> >>>
>> >>>   Nodename=foobar \
>> >>>>      CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>> >>>>      ThreadsPerCore=2
>> >>>>      \
>> >>>>      RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \
>> >>>>      TmpDisk=6000000 Gres=gpu:nvidia_rtx_a6000:1
>> >>>>
>> >>>>   The slurm.conf also has:
>> >>>>
>> >>>>   ProctrackType=proctrack/cgroup
>> >>>>   TaskPlugin=task/affinity,task/cgroup
>> >>>>   TaskPluginParam=Cores,*SlurmdOf**fSpec*,Verbose
>> >>>>
>> >>>
>> >>>   Doesn't setting SlurmdOffSpec tell Slurmd that is should NOT use the
>> >>>   CPUs
>> >>>   in the spec list? (
>> >>>   https://secure-web.cisco.com/1V9Fskh4lCAx_XrdlCr8o1EtnePELf-1YK4TerT47ktLxy3fO9FaIpaGXVA8ODhMAdhmXSqToQstwAilA71r7z1Q4jDqPSKEsJQNUhJYYRtxFnZIO49QxsYrVo9c3ExH89cIk_t7H5dtGEjpme2LFKm23Z52yK-xZ3fEl_LyK61uCzkas6GKykzPCPyoNXaFgs32Ct2tDIVL8vI6JW1_-1uQ9gUyWmm24xJoBxLEui7tSTVwMtiVRu8C7pU1nJ8qr6ghBlxrqx-wQiRP4XBCjhPARHa2KBqkUogjEVRAe3WdAbbYBxtXeVsWjqNGmjSVA/https%3A%2F%2Fslurm.schedmd.com%2Fslurm.conf.html%23OPT_SlurmdOffSpec)
>> >>>   In this case, I believe it uses what is left, which is the 0-13. We
>> are
>> >>>   just starting to work on this ourselves, and were looking at this
>> >>>   setting.
>> >>>
>> >>>   Best,
>> >>>
>> >>>   -Sean
>> >>>
>> >>
>> >
>> The information in this e-mail is intended only for the person to whom it
>> is addressed.  If you believe this e-mail was sent to you in error and the
>> e-mail contains patient information, please contact the Mass General
>> Brigham Compliance HelpLine at
>> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline <
>> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline> .
>> Please note that this e-mail is not secure (encrypted).  If you do not
>> wish to continue communication over unencrypted e-mail, please notify the
>> sender of this message immediately.  Continuing to send or respond to
>> e-mail after receiving this message means you understand and accept this
>> risk and wish to continue to communicate over unencrypted e-mail.
>>
>>
The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.