[slurm-users] CPUSpecList confusion

Tue Dec 13 15:53:41 UTC 2022

Nice find. Thanks for sharing back.

On Tue, Dec 13, 2022 at 10:39 AM Paul Raines <raines at nmr.mgh.harvard.edu>
wrote:

>
> Yes, looks like SLURM is using the apicid that is in /proc/cpuinfo
> The first 14 cpus in /proc/cpu (procs 0-13) have apicid
> 0,2,4,6,8,10,12,14,16,20,22,24,26,28 in /proc/cpuinfo
>
> So after setting CpuSpecList=0,2,4,6,8,10,12,14,16,18,20,22,24,26
> in slurm.conf it appears to be doing what I want
>
> $ echo $SLURM_JOB_ID
> 9
> $ grep -i ^cpu /proc/self/status
> Cpus_allowed:   000f0000,000f0000
> Cpus_allowed_list:      16-19,48-51
> $ scontrol -d show job 9 | grep CPU_ID
>       Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES=
>
> apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo
>
> Thanks
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:
>
> >        External Email - Use Caution
> >
> > In the slurm.conf manual they state the CpuSpecList ids are "abstract",
> and
> > in the CPU management docs they enforce the notion that the abstract
> Slurm
> > IDs are not related to the Linux hardware IDs, so that is probably the
> > source of the behavior. I unfortunately don't have more information.
> >
> > On Tue, Dec 13, 2022 at 9:45 AM Paul Raines <raines at nmr.mgh.harvard.edu>
> > wrote:
> >
> >>
> >> Hmm.  Actually looks like confusion between CPU IDs on the system
> >> and what SLURM thinks the IDs are
> >>
> >> # scontrol -d show job 8
> >> ...
> >>       Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES=
> >> ...
> >>
> >> # cat
> >> /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective
> >> 7-10,39-42
> >>
> >>
> >> -- Paul Raines (
> http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu
> )
> >>
> >>
> >>
> >> On Tue, 13 Dec 2022 9:40am, Paul Raines wrote:
> >>
> >> >
> >> > Oh but that does explain the CfgTRES=cpu=14.  With the CpuSpecList
> >> > below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense.
> >> >
> >> > The issue remains that thought the number of cpus in CpuSpecList
> >> > is taken into account, the exact IDs seem to be ignored.
> >> >
> >> >
> >> > -- Paul Raines (
> http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu
> )
> >> >
> >> >
> >> >
> >> > On Tue, 13 Dec 2022 9:34am, Paul Raines wrote:
> >> >
> >> >>
> >> >>  I have tried it both ways with the same result.  The assigned CPUs
> >> >>  will be both in and out of the range given to CpuSpecList
> >> >>
> >> >>  I tried setting using commas instead of ranges so used
> >> >>
> >> >>  CpuSpecList=0,1,2,3,4,5,6,7,8,9,10,11,12,13
> >> >>
> >> >>  But still does not work
> >> >>
> >> >>  $ srun -p basic -N 1 --ntasks-per-node=1 --mem=25G \
> >> >>  --time=10:00:00 --cpus-per-task=8 --pty /bin/bash
> >> >>  $ grep -i ^cpu /proc/self/status
> >> >>  Cpus_allowed:   00000780,00000780
> >> >>  Cpus_allowed_list:      7-10,39-42
> >> >>
> >> >>
> >> >>  -- Paul Raines (
> http://secure-web.cisco.com/1w33sdTB1gUzmmNOl1cd8t7VHLUOemWW6ExRIq2AHSLm0BwRxhnfCCHDdln0LWn7IZ3IUYdxeX2HzyDj7CeKHq7B1H5ek2tow-D_4Q81mK8_x_AKf6cHYOSqHSBelLikTijDZJGsJYKSleSUlZMG1mqkU4e4TirhUu0qTLKUcvqLxsvi1WCbBbyUaDUxd-c7kE2_v4XzvhBtdEqrkKAWOQF2WoJwhmTJlMhanBk-PdjHDsuDcdOgfHrmIAiRC-T8hB094Y1WvEuOjL4o2Kbx28qp4eUSPu8jSOxPEKoWsHpSDE7fWyjrlcVAsEyOpPgp4/http%3A%2F%2Fhelp.nmr.mgh.harvard.edu
> )
> >> >>
> >> >>
> >> >>
> >> >>  On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote:
> >> >>
> >> >>>   Hi Paul,
> >> >>>
> >> >>>   Nodename=foobar \
> >> >>>>      CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
> >> >>>>      ThreadsPerCore=2
> >> >>>>      \
> >> >>>>      RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \
> >> >>>>      TmpDisk=6000000 Gres=gpu:nvidia_rtx_a6000:1
> >> >>>>
> >> >>>>   The slurm.conf also has:
> >> >>>>
> >> >>>>   ProctrackType=proctrack/cgroup
> >> >>>>   TaskPlugin=task/affinity,task/cgroup
> >> >>>>   TaskPluginParam=Cores,*SlurmdOf**fSpec*,Verbose
> >> >>>>
> >> >>>
> >> >>>   Doesn't setting SlurmdOffSpec tell Slurmd that is should NOT use
> the
> >> >>>   CPUs
> >> >>>   in the spec list? (
> >> >>>
> https://secure-web.cisco.com/1V9Fskh4lCAx_XrdlCr8o1EtnePELf-1YK4TerT47ktLxy3fO9FaIpaGXVA8ODhMAdhmXSqToQstwAilA71r7z1Q4jDqPSKEsJQNUhJYYRtxFnZIO49QxsYrVo9c3ExH89cIk_t7H5dtGEjpme2LFKm23Z52yK-xZ3fEl_LyK61uCzkas6GKykzPCPyoNXaFgs32Ct2tDIVL8vI6JW1_-1uQ9gUyWmm24xJoBxLEui7tSTVwMtiVRu8C7pU1nJ8qr6ghBlxrqx-wQiRP4XBCjhPARHa2KBqkUogjEVRAe3WdAbbYBxtXeVsWjqNGmjSVA/https%3A%2F%2Fslurm.schedmd.com%2Fslurm.conf.html%23OPT_SlurmdOffSpec
> )
> >> >>>   In this case, I believe it uses what is left, which is the 0-13.
> We
> >> are
> >> >>>   just starting to work on this ourselves, and were looking at this
> >> >>>   setting.
> >> >>>
> >> >>>   Best,
> >> >>>
> >> >>>   -Sean
> >> >>>
> >> >>
> >> >
> >> The information in this e-mail is intended only for the person to whom
> it
> >> is addressed.  If you believe this e-mail was sent to you in error and
> the
> >> e-mail contains patient information, please contact the Mass General
> >> Brigham Compliance HelpLine at
> >>
> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline
> <
> >>
> https://secure-web.cisco.com/11OmVChs0jRoe-4AH2iRxvEdMN0dxZcFsunG07PJ0sXxdW7tj7-BUiDwEEi3gkqOms_qFRdQbCLHJQW0jD6cG8-griFmte8mXIoPZSDzIE8dHcew9yMCpQxJnYVVs8mK5aB-9o4ospPlPqxo3FA0LN8gpJSrsBKOxr5m7T3Jd7FY04zJnehrYc0FQwfWAPJx523fZTqVTTmwZgdEFZAQtURZ8hPxlohSzsh7d13L7byOVUmxAxzolzDTvRSH9l1gjMm-RjtdW95eYkgPlRoM3nJ0WCledYAp5NA3kUGNhsc5uNDp3lWIzS7gZGIMfTyg9/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
> .
> >> Please note that this e-mail is not secure (encrypted).  If you do not
> >> wish to continue communication over unencrypted e-mail, please notify
> the
> >> sender of this message immediately.  Continuing to send or respond to
> >> e-mail after receiving this message means you understand and accept this
> >> risk and wish to continue to communicate over unencrypted e-mail.
> >>
> >>
> The information in this e-mail is intended only for the person to whom it
> is addressed.  If you believe this e-mail was sent to you in error and the
> e-mail contains patient information, please contact the Mass General
> Brigham Compliance HelpLine at
> https://www.massgeneralbrigham.org/complianceline <
> https://www.massgeneralbrigham.org/complianceline> .
> Please note that this e-mail is not secure (encrypted).  If you do not
> wish to continue communication over unencrypted e-mail, please notify the
> sender of this message immediately.  Continuing to send or respond to
> e-mail after receiving this message means you understand and accept this
> risk and wish to continue to communicate over unencrypted e-mail.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221213/90d8c4e8/attachment.htm>