[slurm-users] [EXT] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?

Fri Feb 12 09:08:13 UTC 2021

Hi Thomas,

Indeed, even on my cluster, the CPU ID does not match the physical CPU
assigned to the job

# scontrol show job 24115206_399 -d
JobId=24115684 ArrayJobId=24115206 ArrayTaskId=399 JobName=s10
   JOB_GRES=(null)
     Nodes=spartan-bm096 CPU_IDs=50 Mem=4000 GRES=

[root at spartan-bm096 ~]# cat
/sys/fs/cgroup/cpuset/slurm/uid_11470/job_24115684/cpuset.cpus
58

I will keep searching. I know we capture the real CPU ID as well, using
daemons running on the worker nodes, and we feed that into Ganglia.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Fri, 12 Feb 2021 at 06:15, Thomas Zeiser <
thomas.zeiser at rrze.uni-erlangen.de> wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Hi Sean,
>
> unfortunately, the CPU_IDs and GPU IDX given by "scontrol -d show
> job JOBID" are not related in any way to the ordering of the
> hardware. It seems to be just the sequence of the cores / GPUs
> assigned by Slurm.
>
>
> For reference: The PCI-IDs of the GPUs when run as root outside of
> any cgroup:
>
> | GPU  Name        Persistence-M| Bus-Id        Disp.A |
> |   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |
> |   1  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |
> |   2  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |
> |   3  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |
>
>
>
> I submitted a job requesting 1 GPU and 3 GPU to a node with 4
> GPUs. Both run concurrently.
>
>
> Output of the 1st 1 GPU job:
>
> |   0  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |
>   0 |
> SLURM_JOB_GPUS=0
> GPU_DEVICE_ORDINAL=0
> CUDA_VISIBLE_DEVICES=0
>      Nodes=tg091 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)
> /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id
> -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=64-95,192-223
>
> I understand CUDA_VISIBLE_DEVICES=0 as that is within the cgroup.
> However, 00000000:41:00.0 is by no means IDX0; it's only the 1st
> GPU assigned on the node by Slurm.
> CPU-IDs do not match the cpuset in any way. (CPUs are 2x 64 cores with SMT
> enabled)
>
>
> Output of the 2nd 3 GPU job running concurrently:
> |   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |
>   0 |
> |   1  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |
>   0 |
> |   2  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |
>   0 |
> SLURM_JOB_GPUS=1,2,3
> GPU_DEVICE_ORDINAL=0,1,2
> CUDA_VISIBLE_DEVICES=0,1,2
>      Nodes=tg091 CPU_IDs=64-255 Mem=360000 GRES=gpu:a100:3(IDX:1-3)
> /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id
> -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-63,96-191,224-255
>
> Again CUDA_VISIBLE_DEVICES=0,1,2 is reasonable within the cgroup.
> However, IDX:1-3 or SLURM_JOB_GPUS=1,2,3 does not correspond to the
> Bus-IDs which would be 0, 2, 3 according to the non-cgroup output.
> Again, no relation between CPU-IDs and cpuset.
>
>
>
> If the jobs are started in reverse order:
>
> Output of the 3 GPU job started as first job on the node:
> |   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |
>   0 |
> |   1  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |
>   0 |
> |   2  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |
>   0 |
> SLURM_JOB_GPUS=0,1,2
> GPU_DEVICE_ORDINAL=0,1,2
> CUDA_VISIBLE_DEVICES=0,1,2
>      Nodes=tg091 CPU_IDs=0-191 Mem=360000 GRES=gpu:a100:3(IDX:0-2)
> /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id
> -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-95,128-223
>
> => IDX:0-2 does not correspond to the Bus-IDs which would be 0, 1,
> 3 according to the non-cgroup output.
>
>
> Output of the 1 GPU job started second but running concurrently:
> |   0  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |
>   0 |
> SLURM_JOB_GPUS=3
> GPU_DEVICE_ORDINAL=0
> CUDA_VISIBLE_DEVICES=0
>      Nodes=tg091 CPU_IDs=192-255 Mem=120000 GRES=gpu:a100:1(IDX:3)
> /sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id
> -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=96-127,224-255
>
>
> If three jobs requesting 1, 2, and 1 GPU are submitted in that
> order, it is even worse as the 2 GPU job will be assigned to the
> 2nd socket while the last jobs will fill up the 1st socket. I can
> clearly be seen that GRES=gpu:a100:2(IDX is just incremented but
> not related to hardware location.
>
> |   0  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |
>   0 |
> SLURM_JOB_GPUS=0
> GPU_DEVICE_ORDINAL=0
> CUDA_VISIBLE_DEVICES=0
>      Nodes=tg094 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)
> 0-31,128-159
>
>
> |   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |
>   0 |
> |   1  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |
>   0 |
> SLURM_JOB_GPUS=1,2
> GPU_DEVICE_ORDINAL=0,1
> CUDA_VISIBLE_DEVICES=0,1
>      Nodes=tg094 CPU_IDs=128-255 Mem=240000 GRES=gpu:a100:2(IDX:1-2)
> 64-127,192-255
>
>
> |   0  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |
>   0 |
> SLURM_JOB_GPUS=3
> GPU_DEVICE_ORDINAL=0
> CUDA_VISIBLE_DEVICES=0
>      Nodes=tg094 CPU_IDs=64-127 Mem=120000 GRES=gpu:a100:1(IDX:3)
> 32-63,160-191
>
>
>
> Best regards
>
> thomas
>
> On Fri, Feb 05, 2021 at 07:37:37PM +1100, Sean Crosby wrote:
> > Hi Thomas,
> >
> > Add the -d flag to scontrol show job
> >
> > e.g.
> >
> > # scontrol show job 23891862 -d
> > JobId=23891862 JobName=SPI_DOWN
> >    UserId=user1(11283) GroupId=group1(10414) MCS_label=N/A
> >    Priority=586 Nice=0 Account=group1 QOS=qos1
> >    JobState=RUNNING Reason=None Dependency=(null)
> >    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> >    DerivedExitCode=0:0
> >    RunTime=2-00:13:58 TimeLimit=7-00:00:00 TimeMin=N/A
> >    SubmitTime=2021-02-03T19:19:28 EligibleTime=2021-02-03T19:19:28
> >    AccrueTime=2021-02-03T19:19:31
> >    StartTime=2021-02-03T19:19:31 EndTime=2021-02-10T19:19:31 Deadline=N/A
> >    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-03T19:19:31
> >    Partition=gpgpu AllocNode:Sid=spartan-login3:222306
> >    ReqNodeList=(null) ExcNodeList=(null)
> >    NodeList=spartan-gpgpu007
> >    BatchHost=spartan-gpgpu007
> >    NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
> >    TRES=cpu=6,mem=24000M,node=1,billing=101,gres/gpu=1
> >    Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
> >    JOB_GRES=gpu:1
> >      Nodes=spartan-gpgpu007 CPU_IDs=6-11 Mem=24000 GRES=gpu:1(IDX:1)
> >    MinCPUsNode=6 MinMemoryCPU=4000M MinTmpDiskNode=0
> >    Features=(null) DelayBoot=00:00:00
> >    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >
> > Note the CPU_IDs and GPU IDX in the output
> >
> > Sean
> >
> > --
> > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> > Research Computing Services | Business Services
> > The University of Melbourne, Victoria 3010 Australia
> >
> >
> >
> > On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser <
> > thomas.zeiser at rrze.uni-erlangen.de> wrote:
> >
> > > UoM notice: External email. Be cautious of links, attachments, or
> > > impersonation attempts
> > >
> > > Dear All,
> > >
> > > we are running Slurm-20.02.6 and using
> > > "SelectType=select/cons_tres" with
> > > "SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup",
> > > and "ProctrackType=proctrack/cgroup". Nodes can be shared between
> > > multiple jobs with the partition defaults "ExclusiveUser=no
> > > OverSubscribe=No"
> > >
> > > For monitoring purpose, we'd like to know on the ControlMachine
> > > which cores of a batch node are assigned to a specific job. Is
> > > there any way (except looking on each batch node itself into
> > > /sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or
> > > GPU IDs?
> > >
> > > E.g. from Torque we are used that qstat tells the assigned cores.
> > > However, with Slurm, even "scontrol show job JOBID" does not seem
> > > to have any information in that direction.
> > >
> > > Knowing which GPU is allocated (in case of gres/gpu) of course
> > > also would be interested to know on the ControlMachine.
> > >
> > >
> > > Here's the output we get from scontrol show job; it has the node
> > > name and the number of cores assigned but not the "core IDs" (e.g.
> > > 32-63)
> > >
> > > JobId=886 JobName=br-14
> > >    UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A
> > >    Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=*
> > >    JobState=RUNNING Reason=None Dependency=(null)
> > >    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> > >    RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A
> > >    SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51
> > >    AccrueTime=2021-02-04T07:26:51
> > >    StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54
> Deadline=N/A
> > >    PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None
> > >    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54
> > >    Partition=a100 AllocNode:Sid=gpu001:1743663
> > >    ReqNodeList=(null) ExcNodeList=(null)
> > >    NodeList=gpu001
> > >    BatchHost=gpu001
> > >    NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> > >    TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1
> > >    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> > >    MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0
> > >    Features=(null) DelayBoot=00:00:00
> > >    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> > >    Command=/var/tmp/slurmd_spool/job00877/slurm_script
> > >    WorkDir=/home/hpc114/run2
> > >    StdErr=/home/hpc114//run2/br-14.o886
> > >    StdIn=/dev/null
> > >    StdOut=/home/hpc114/run2/br-14.o886
> > >    Power=
> > >    TresPerNode=gpu:a100:1
> > >    MailUser=(null) MailType=NONE
> > >
> > > Also "scontrol show node" is not helpful
> > >
> > > NodeName=gpu001 Arch=x86_64 CoresPerSocket=64
> > >    CPUAlloc=128 CPUTot=128 CPULoad=4.09
> > >    AvailableFeatures=hwperf
> > >    ActiveFeatures=hwperf
> > >    Gres=gpu:a100:4(S:0-1)
> > >    NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6
> > >    OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC
> 2021
> > >    RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1
> > >    State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A
> > > MCS_label=N/A
> > >    Partitions=a100
> > >    BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05
> > >    CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4
> > >    AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4
> > >    CapWatts=n/a
> > >    CurrentWatts=0 AveWatts=0
> > >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> > >
> > > There is no information on the currently running four jobs
> > > included; neither which share of the allocated node is assigned to
> > > the individual jobs.
> > >
> > >
> > > I'd like to see isomehow that job 886 got cores 32-63,160-191
> > > assigned as seen on the node from /sys/fs/cgroup
> > >
> > > %cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus
> > > 32-63,160-191
> > >
> > >
> > > Thanks for any ideas!
> > >
> > > Thomas Zeiser
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210212/8a99c4b8/attachment-0001.htm>