<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><div>Hi Thomas,</div><div><br></div><div>Indeed, even on my cluster, the CPU ID does not match the physical CPU assigned to the job</div><div><br></div><div style="margin-left:40px"># scontrol show job 24115206_399 -d<br>JobId=24115684 ArrayJobId=24115206 ArrayTaskId=399 JobName=s10</div><div style="margin-left:40px">   JOB_GRES=(null)<br>     Nodes=spartan-bm096 CPU_IDs=50 Mem=4000 GRES=</div><div style="margin-left:40px"><br></div><div style="margin-left:40px">[root@spartan-bm096 ~]# cat /sys/fs/cgroup/cpuset/slurm/uid_11470/job_24115684/cpuset.cpus<br>58</div><div style="margin-left:40px"><br></div><div>I will keep searching. I know we capture the real CPU ID as well, using daemons running on the worker nodes, and we feed that into Ganglia.</div><div><br></div><div>Sean</div><div><br></div><div><div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">--<br>Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead<br>Research Computing Services | Business Services<br>The University of Melbourne, Victoria 3010 Australia<br><br></div></div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 12 Feb 2021 at 06:15, Thomas Zeiser <<a href="mailto:thomas.zeiser@rrze.uni-erlangen.de">thomas.zeiser@rrze.uni-erlangen.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">UoM notice: External email. Be cautious of links, attachments, or impersonation attempts<br>

<br>

Hi Sean,<br>

<br>

unfortunately, the CPU_IDs and GPU IDX given by "scontrol -d show<br>

job JOBID" are not related in any way to the ordering of the<br>

hardware. It seems to be just the sequence of the cores / GPUs<br>

assigned by Slurm.<br>

<br>

<br>

For reference: The PCI-IDs of the GPUs when run as root outside of<br>

any cgroup:<br>

<br>

| GPU  Name        Persistence-M| Bus-Id        Disp.A |<br>

|   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |<br>

|   1  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |<br>

|   2  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |<br>

|   3  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |<br>

<br>

<br>

<br>

I submitted a job requesting 1 GPU and 3 GPU to a node with 4<br>

GPUs. Both run concurrently.<br>

<br>

<br>

Output of the 1st 1 GPU job:<br>

<br>

|   0  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |                    0 |<br>

SLURM_JOB_GPUS=0<br>

GPU_DEVICE_ORDINAL=0<br>

CUDA_VISIBLE_DEVICES=0<br>

     Nodes=tg091 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)<br>

/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=64-95,192-223<br>

<br>

I understand CUDA_VISIBLE_DEVICES=0 as that is within the cgroup.<br>

However, 00000000:41:00.0 is by no means IDX0; it's only the 1st<br>

GPU assigned on the node by Slurm.<br>

CPU-IDs do not match the cpuset in any way. (CPUs are 2x 64 cores with SMT enabled)<br>

<br>

<br>

Output of the 2nd 3 GPU job running concurrently:<br>

|   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |                    0 |<br>

|   1  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |                    0 |<br>

|   2  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |                    0 |<br>

SLURM_JOB_GPUS=1,2,3<br>

GPU_DEVICE_ORDINAL=0,1,2<br>

CUDA_VISIBLE_DEVICES=0,1,2<br>

     Nodes=tg091 CPU_IDs=64-255 Mem=360000 GRES=gpu:a100:3(IDX:1-3)<br>

/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-63,96-191,224-255<br>

<br>

Again CUDA_VISIBLE_DEVICES=0,1,2 is reasonable within the cgroup.<br>

However, IDX:1-3 or SLURM_JOB_GPUS=1,2,3 does not correspond to the<br>

Bus-IDs which would be 0, 2, 3 according to the non-cgroup output.<br>

Again, no relation between CPU-IDs and cpuset.<br>

<br>

<br>

<br>

If the jobs are started in reverse order:<br>

<br>

Output of the 3 GPU job started as first job on the node:<br>

|   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |                    0 |<br>

|   1  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |                    0 |<br>

|   2  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |                    0 |<br>

SLURM_JOB_GPUS=0,1,2<br>

GPU_DEVICE_ORDINAL=0,1,2<br>

CUDA_VISIBLE_DEVICES=0,1,2<br>

     Nodes=tg091 CPU_IDs=0-191 Mem=360000 GRES=gpu:a100:3(IDX:0-2)<br>

/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-95,128-223<br>

<br>

=> IDX:0-2 does not correspond to the Bus-IDs which would be 0, 1,<br>

3 according to the non-cgroup output.<br>

<br>

<br>

Output of the 1 GPU job started second but running concurrently:<br>

|   0  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |                    0 |<br>

SLURM_JOB_GPUS=3<br>

GPU_DEVICE_ORDINAL=0<br>

CUDA_VISIBLE_DEVICES=0<br>

     Nodes=tg091 CPU_IDs=192-255 Mem=120000 GRES=gpu:a100:1(IDX:3)<br>

/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=96-127,224-255<br>

<br>

<br>

If three jobs requesting 1, 2, and 1 GPU are submitted in that<br>

order, it is even worse as the 2 GPU job will be assigned to the<br>

2nd socket while the last jobs will fill up the 1st socket. I can<br>

clearly be seen that GRES=gpu:a100:2(IDX is just incremented but<br>

not related to hardware location.<br>

<br>

|   0  A100-SXM4-40GB      On   | 00000000:41:00.0 Off |                    0 |<br>

SLURM_JOB_GPUS=0<br>

GPU_DEVICE_ORDINAL=0<br>

CUDA_VISIBLE_DEVICES=0<br>

     Nodes=tg094 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)<br>

0-31,128-159<br>

<br>

<br>

|   0  A100-SXM4-40GB      On   | 00000000:01:00.0 Off |                    0 |<br>

|   1  A100-SXM4-40GB      On   | 00000000:C1:00.0 Off |                    0 |<br>

SLURM_JOB_GPUS=1,2<br>

GPU_DEVICE_ORDINAL=0,1<br>

CUDA_VISIBLE_DEVICES=0,1<br>

     Nodes=tg094 CPU_IDs=128-255 Mem=240000 GRES=gpu:a100:2(IDX:1-2)<br>

64-127,192-255<br>

<br>

<br>

|   0  A100-SXM4-40GB      On   | 00000000:81:00.0 Off |                    0 |<br>

SLURM_JOB_GPUS=3<br>

GPU_DEVICE_ORDINAL=0<br>

CUDA_VISIBLE_DEVICES=0<br>

     Nodes=tg094 CPU_IDs=64-127 Mem=120000 GRES=gpu:a100:1(IDX:3)<br>

32-63,160-191<br>

<br>

<br>

<br>

Best regards<br>

<br>

thomas<br>

<br>

On Fri, Feb 05, 2021 at 07:37:37PM +1100, Sean Crosby wrote:<br>

> Hi Thomas,<br>

> <br>

> Add the -d flag to scontrol show job<br>

> <br>

> e.g.<br>

> <br>

> # scontrol show job 23891862 -d<br>

> JobId=23891862 JobName=SPI_DOWN<br>

>    UserId=user1(11283) GroupId=group1(10414) MCS_label=N/A<br>

>    Priority=586 Nice=0 Account=group1 QOS=qos1<br>

>    JobState=RUNNING Reason=None Dependency=(null)<br>

>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>

>    DerivedExitCode=0:0<br>

>    RunTime=2-00:13:58 TimeLimit=7-00:00:00 TimeMin=N/A<br>

>    SubmitTime=2021-02-03T19:19:28 EligibleTime=2021-02-03T19:19:28<br>

>    AccrueTime=2021-02-03T19:19:31<br>

>    StartTime=2021-02-03T19:19:31 EndTime=2021-02-10T19:19:31 Deadline=N/A<br>

>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-03T19:19:31<br>

>    Partition=gpgpu AllocNode:Sid=spartan-login3:222306<br>

>    ReqNodeList=(null) ExcNodeList=(null)<br>

>    NodeList=spartan-gpgpu007<br>

>    BatchHost=spartan-gpgpu007<br>

>    NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*<br>

>    TRES=cpu=6,mem=24000M,node=1,billing=101,gres/gpu=1<br>

>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*<br>

>    JOB_GRES=gpu:1<br>

>      Nodes=spartan-gpgpu007 CPU_IDs=6-11 Mem=24000 GRES=gpu:1(IDX:1)<br>

>    MinCPUsNode=6 MinMemoryCPU=4000M MinTmpDiskNode=0<br>

>    Features=(null) DelayBoot=00:00:00<br>

>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br>

> <br>

> Note the CPU_IDs and GPU IDX in the output<br>

> <br>

> Sean<br>

> <br>

> --<br>

> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead<br>

> Research Computing Services | Business Services<br>

> The University of Melbourne, Victoria 3010 Australia<br>

> <br>

> <br>

> <br>

> On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser <<br>

> <a href="mailto:thomas.zeiser@rrze.uni-erlangen.de" target="_blank">thomas.zeiser@rrze.uni-erlangen.de</a>> wrote:<br>

> <br>

> > UoM notice: External email. Be cautious of links, attachments, or<br>

> > impersonation attempts<br>

> ><br>

> > Dear All,<br>

> ><br>

> > we are running Slurm-20.02.6 and using<br>

> > "SelectType=select/cons_tres" with<br>

> > "SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup",<br>

> > and "ProctrackType=proctrack/cgroup". Nodes can be shared between<br>

> > multiple jobs with the partition defaults "ExclusiveUser=no<br>

> > OverSubscribe=No"<br>

> ><br>

> > For monitoring purpose, we'd like to know on the ControlMachine<br>

> > which cores of a batch node are assigned to a specific job. Is<br>

> > there any way (except looking on each batch node itself into<br>

> > /sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or<br>

> > GPU IDs?<br>

> ><br>

> > E.g. from Torque we are used that qstat tells the assigned cores.<br>

> > However, with Slurm, even "scontrol show job JOBID" does not seem<br>

> > to have any information in that direction.<br>

> ><br>

> > Knowing which GPU is allocated (in case of gres/gpu) of course<br>

> > also would be interested to know on the ControlMachine.<br>

> ><br>

> ><br>

> > Here's the output we get from scontrol show job; it has the node<br>

> > name and the number of cores assigned but not the "core IDs" (e.g.<br>

> > 32-63)<br>

> ><br>

> > JobId=886 JobName=br-14<br>

> >    UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A<br>

> >    Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=*<br>

> >    JobState=RUNNING Reason=None Dependency=(null)<br>

> >    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>

> >    RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A<br>

> >    SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51<br>

> >    AccrueTime=2021-02-04T07:26:51<br>

> >    StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A<br>

> >    PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None<br>

> >    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54<br>

> >    Partition=a100 AllocNode:Sid=gpu001:1743663<br>

> >    ReqNodeList=(null) ExcNodeList=(null)<br>

> >    NodeList=gpu001<br>

> >    BatchHost=gpu001<br>

> >    NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*<br>

> >    TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1<br>

> >    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br>

> >    MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0<br>

> >    Features=(null) DelayBoot=00:00:00<br>

> >    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br>

> >    Command=/var/tmp/slurmd_spool/job00877/slurm_script<br>

> >    WorkDir=/home/hpc114/run2<br>

> >    StdErr=/home/hpc114//run2/br-14.o886<br>

> >    StdIn=/dev/null<br>

> >    StdOut=/home/hpc114/run2/br-14.o886<br>

> >    Power=<br>

> >    TresPerNode=gpu:a100:1<br>

> >    MailUser=(null) MailType=NONE<br>

> ><br>

> > Also "scontrol show node" is not helpful<br>

> ><br>

> > NodeName=gpu001 Arch=x86_64 CoresPerSocket=64<br>

> >    CPUAlloc=128 CPUTot=128 CPULoad=4.09<br>

> >    AvailableFeatures=hwperf<br>

> >    ActiveFeatures=hwperf<br>

> >    Gres=gpu:a100:4(S:0-1)<br>

> >    NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6<br>

> >    OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021<br>

> >    RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1<br>

> >    State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A<br>

> > MCS_label=N/A<br>

> >    Partitions=a100<br>

> >    BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05<br>

> >    CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4<br>

> >    AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4<br>

> >    CapWatts=n/a<br>

> >    CurrentWatts=0 AveWatts=0<br>

> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>

> ><br>

> > There is no information on the currently running four jobs<br>

> > included; neither which share of the allocated node is assigned to<br>

> > the individual jobs.<br>

> ><br>

> ><br>

> > I'd like to see isomehow that job 886 got cores 32-63,160-191<br>

> > assigned as seen on the node from /sys/fs/cgroup<br>

> ><br>

> > %cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus<br>

> > 32-63,160-191<br>

> ><br>

> ><br>

> > Thanks for any ideas!<br>

> ><br>

> > Thomas Zeiser<br>

<br>

</blockquote></div>