<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><div>Hi Thomas,</div><div><br></div><div>Add the -d flag to scontrol show job</div><div><br></div><div>e.g.</div><div><br></div><div style="margin-left:40px"># scontrol show job 23891862 -d<br>JobId=23891862 JobName=SPI_DOWN<br>   UserId=user1(11283) GroupId=group1(10414) MCS_label=N/A<br>   Priority=586 Nice=0 Account=group1 QOS=qos1<br>   JobState=RUNNING Reason=None Dependency=(null)<br>   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>   DerivedExitCode=0:0<br>   RunTime=2-00:13:58 TimeLimit=7-00:00:00 TimeMin=N/A<br>   SubmitTime=2021-02-03T19:19:28 EligibleTime=2021-02-03T19:19:28<br>   AccrueTime=2021-02-03T19:19:31<br>   StartTime=2021-02-03T19:19:31 EndTime=2021-02-10T19:19:31 Deadline=N/A<br>   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-03T19:19:31<br>   Partition=gpgpu AllocNode:Sid=spartan-login3:222306<br>   ReqNodeList=(null) ExcNodeList=(null)<br>   NodeList=spartan-gpgpu007<br>   BatchHost=spartan-gpgpu007<br>   NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*<br>   TRES=cpu=6,mem=24000M,node=1,billing=101,gres/gpu=1<br>   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*<br>   JOB_GRES=gpu:1<br>     Nodes=spartan-gpgpu007 CPU_IDs=6-11 Mem=24000 GRES=gpu:1(IDX:1)<br>   MinCPUsNode=6 MinMemoryCPU=4000M MinTmpDiskNode=0<br>   Features=(null) DelayBoot=00:00:00<br>   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br>   </div><div>Note the CPU_IDs and GPU IDX in the output</div><div><br></div><div>Sean</div><div><br></div><div><div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">--<br>Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead<br>Research Computing Services | Business Services<br>The University of Melbourne, Victoria 3010 Australia<br><br></div></div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser <<a href="mailto:thomas.zeiser@rrze.uni-erlangen.de">thomas.zeiser@rrze.uni-erlangen.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">UoM notice: External email. Be cautious of links, attachments, or impersonation attempts<br>
<br>
Dear All,<br>
<br>
we are running Slurm-20.02.6 and using<br>
"SelectType=select/cons_tres" with<br>
"SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup",<br>
and "ProctrackType=proctrack/cgroup". Nodes can be shared between<br>
multiple jobs with the partition defaults "ExclusiveUser=no<br>
OverSubscribe=No"<br>
<br>
For monitoring purpose, we'd like to know on the ControlMachine<br>
which cores of a batch node are assigned to a specific job. Is<br>
there any way (except looking on each batch node itself into<br>
/sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or<br>
GPU IDs?<br>
<br>
E.g. from Torque we are used that qstat tells the assigned cores.<br>
However, with Slurm, even "scontrol show job JOBID" does not seem<br>
to have any information in that direction.<br>
<br>
Knowing which GPU is allocated (in case of gres/gpu) of course<br>
also would be interested to know on the ControlMachine.<br>
<br>
<br>
Here's the output we get from scontrol show job; it has the node<br>
name and the number of cores assigned but not the "core IDs" (e.g.<br>
32-63)<br>
<br>
JobId=886 JobName=br-14<br>
   UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A<br>
   Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=*<br>
   JobState=RUNNING Reason=None Dependency=(null)<br>
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>
   RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A<br>
   SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51<br>
   AccrueTime=2021-02-04T07:26:51<br>
   StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A<br>
   PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None<br>
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54<br>
   Partition=a100 AllocNode:Sid=gpu001:1743663<br>
   ReqNodeList=(null) ExcNodeList=(null)<br>
   NodeList=gpu001<br>
   BatchHost=gpu001<br>
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*<br>
   TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1<br>
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br>
   MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0<br>
   Features=(null) DelayBoot=00:00:00<br>
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)<br>
   Command=/var/tmp/slurmd_spool/job00877/slurm_script<br>
   WorkDir=/home/hpc114/run2<br>
   StdErr=/home/hpc114//run2/br-14.o886<br>
   StdIn=/dev/null<br>
   StdOut=/home/hpc114/run2/br-14.o886<br>
   Power=<br>
   TresPerNode=gpu:a100:1<br>
   MailUser=(null) MailType=NONE<br>
<br>
Also "scontrol show node" is not helpful<br>
<br>
NodeName=gpu001 Arch=x86_64 CoresPerSocket=64 <br>
   CPUAlloc=128 CPUTot=128 CPULoad=4.09<br>
   AvailableFeatures=hwperf<br>
   ActiveFeatures=hwperf<br>
   Gres=gpu:a100:4(S:0-1)<br>
   NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6<br>
   OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 <br>
   RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1<br>
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A MCS_label=N/A<br>
   Partitions=a100 <br>
   BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05<br>
   CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4<br>
   AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4<br>
   CapWatts=n/a<br>
   CurrentWatts=0 AveWatts=0<br>
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
<br>
There is no information on the currently running four jobs<br>
included; neither which share of the allocated node is assigned to<br>
the individual jobs.<br>
<br>
<br>
I'd like to see isomehow that job 886 got cores 32-63,160-191<br>
assigned as seen on the node from /sys/fs/cgroup<br>
<br>
%cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus<br>
32-63,160-191<br>
<br>
<br>
Thanks for any ideas!<br>
<br>
Thomas Zeiser<br>
<br>
</blockquote></div>