[slurm-users] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?
Thomas Zeiser
thomas.zeiser at rrze.uni-erlangen.de
Thu Feb 4 15:01:12 UTC 2021
Dear All,
we are running Slurm-20.02.6 and using
"SelectType=select/cons_tres" with
"SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup",
and "ProctrackType=proctrack/cgroup". Nodes can be shared between
multiple jobs with the partition defaults "ExclusiveUser=no
OverSubscribe=No"
For monitoring purpose, we'd like to know on the ControlMachine
which cores of a batch node are assigned to a specific job. Is
there any way (except looking on each batch node itself into
/sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or
GPU IDs?
E.g. from Torque we are used that qstat tells the assigned cores.
However, with Slurm, even "scontrol show job JOBID" does not seem
to have any information in that direction.
Knowing which GPU is allocated (in case of gres/gpu) of course
also would be interested to know on the ControlMachine.
Here's the output we get from scontrol show job; it has the node
name and the number of cores assigned but not the "core IDs" (e.g.
32-63)
JobId=886 JobName=br-14
UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A
Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=*
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51
AccrueTime=2021-02-04T07:26:51
StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A
PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54
Partition=a100 AllocNode:Sid=gpu001:1743663
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gpu001
BatchHost=gpu001
NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/var/tmp/slurmd_spool/job00877/slurm_script
WorkDir=/home/hpc114/run2
StdErr=/home/hpc114//run2/br-14.o886
StdIn=/dev/null
StdOut=/home/hpc114/run2/br-14.o886
Power=
TresPerNode=gpu:a100:1
MailUser=(null) MailType=NONE
Also "scontrol show node" is not helpful
NodeName=gpu001 Arch=x86_64 CoresPerSocket=64
CPUAlloc=128 CPUTot=128 CPULoad=4.09
AvailableFeatures=hwperf
ActiveFeatures=hwperf
Gres=gpu:a100:4(S:0-1)
NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6
OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021
RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A MCS_label=N/A
Partitions=a100
BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05
CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4
AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
There is no information on the currently running four jobs
included; neither which share of the allocated node is assigned to
the individual jobs.
I'd like to see isomehow that job 886 got cores 32-63,160-191
assigned as seen on the node from /sys/fs/cgroup
%cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus
32-63,160-191
Thanks for any ideas!
Thomas Zeiser
More information about the slurm-users
mailing list