[slurm-users] How to view GPU indices of the completed jobs?

Tue Jun 23 13:02:05 UTC 2020

Hi Kota,

thanks for the hint.

Yet, I'm still a little bit astonished, as if I remember right, 
CUDA_VISIBLE_DEVICES in a cgroup always start from zero. That has been 
already years ago, as we still used LSF.

But SLURM_JOB_GPUS seems to be the right thing:

same node, two different users (and therefore jobs)

$> xargs --null --max-args=1 echo < /proc/32719/environ | egrep "GPU|CUDA"
SLURM_JOB_GPUS=0
CUDA_VISIBLE_DEVICES=0
GPU_DEVICE_ORDINAL=0

$> xargs --null --max-args=1 echo < /proc/109479/environ | egrep "GPU|CUDA"
SLURM_MEM_PER_GPU=6144
SLURM_JOB_GPUS=1
CUDA_VISIBLE_DEVICES=0
GPU_DEVICE_ORDINAL=0
CUDA_ROOT=/usr/local_rwth/sw/cuda/10.1.243
CUDA_PATH=/usr/local_rwth/sw/cuda/10.1.243
CUDA_VERSION=101

SLURM_JOB_GPU differs

$> scontrol show -d job 14658274
...
Nodes=nrg02 CPU_IDs=24 Mem=8192 GRES_IDX=gpu:volta(IDX:1)

$> scontrol show -d job 14673550
...
Nodes=nrg02 CPU_IDs=0 Mem=8192 GRES_IDX=gpu:volta(IDX:0)

Is there anyone out there, who can confirm this besides me?

Best
Marcus

Am 23.06.2020 um 04:51 schrieb Kota Tsuyuzaki:
>> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
>> starts from zero. So this is NOT the index of the GPU.
> 
> Thanks. Just FYI, when I tested the environment variables with Slurm 19.05.2 + proctrack/cgroup configuration, It looks CUDA_VISIBLE_DEVICES fits the indices on the host devices (i.e. not started from zero). I'm not sure if the behavior would be changed in the newer Slurm version though.
> 
> I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set in environment variables that can be useful. In my current tests, those variables ware being same values with CUDA_VISILE_DEVICES.
> 
> Any advices on what I should look for, is always welcome..
> 
> Best,
> Kota
> 
>> -----Original Message-----
>> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Marcus Wagner
>> Sent: Tuesday, June 16, 2020 9:17 PM
>> To: slurm-users at lists.schedmd.com
>> Subject: Re: [slurm-users] How to view GPU indices of the completed jobs?
>>
>> Hi David,
>>
>> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
>> starts from zero. So this is NOT the index of the GPU.
>>
>> Just verified it:
>> $> nvidia-smi
>> Tue Jun 16 13:28:47 2020
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version:
>> 10.2     |
>> ...
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID   Type   Process name                             Usage
>>        |
>> |=========================================================================
>> ====|
>> |    0     17269      C   gmx_mpi
>> 679MiB |
>> |    1     19246      C   gmx_mpi
>> 513MiB |
>> +-----------------------------------------------------------------------------+
>>
>> $> squeue -w nrg04
>>                JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>             14560009  c18g_low     egf5 bk449967  R 1-00:17:48      1 nrg04
>>             14560005  c18g_low     egf1 bk449967  R 1-00:20:23      1 nrg04
>>
>>
>> $> scontrol show job -d 14560005
>> ...
>>      Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
>>        Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)
>>
>> $> scontrol show job -d 14560009
>> JobId=14560009 JobName=egf5
>> ...
>>      Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
>>        Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)
>>
>>   From the PIDs from nvidia-smi ouput:
>>
>> $> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE
>> CUDA_VISIBLE_DEVICES=0
>>
>> $> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE
>> CUDA_VISIBLE_DEVICES=0
>>
>>
>> So this is only a way to see how MANY devices were used, not which.
>>
>>
>> Best
>> Marcus
>>
>> Am 10.06.2020 um 20:49 schrieb David Braun:
>>> Hi Kota,
>>>
>>> This is from the job template that I give to my users:
>>>
>>> # Collect some information about the execution environment that may
>>> # be useful should we need to do some debugging.
>>>
>>> echo "CREATING DEBUG DIRECTORY"
>>> echo
>>>
>>> mkdir .debug_info
>>> module list > .debug_info/environ_modules 2>&1
>>> ulimit -a > .debug_info/limits 2>&1
>>> hostname > .debug_info/environ_hostname 2>&1
>>> env |grep SLURM > .debug_info/environ_slurm 2>&1
>>> env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
>>> env |grep OMPI > .debug_info/environ_openmpi 2>&1
>>> env > .debug_info/environ 2>&1
>>>
>>> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
>>>           echo "SAVING CUDA ENVIRONMENT"
>>>           echo
>>>           env |grep CUDA > .debug_info/environ_cuda 2>&1
>>> fi
>>>
>>> You could add something like this to one of the SLURM prologs to save
>>> the GPU list of jobs.
>>>
>>> Best,
>>>
>>> David
>>>
>>> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki
>>> <kota.tsuyuzaki.pc at hco.ntt.co.jp
>>> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>> wrote:
>>>
>>>      Hello Guys,
>>>
>>>      We are running GPU clusters with Slurm and SlurmDBD (version 19.05
>>>      series) and some of GPUs seemed to get troubles for attached
>>>      jobs. To investigate if the troubles happened on the same GPUs, I'd
>>>      like to get GPU indices of the completed jobs.
>>>
>>>      In my understanding `scontrol show job` can show the indices (as IDX
>>>      in gres info) but cannot be used for completed job. And also
>>>      `sacct -j` is available for complete jobs but won't print the indices.
>>>
>>>      Is there any way (commands, configurations, etc...) to see the
>>>      allocated GPU indices for completed jobs?
>>>
>>>      Best regards,
>>>
>>>      --------------------------------------------
>>>      露崎　浩太 (Kota Tsuyuzaki)
>>>      kota.tsuyuzaki.pc at hco.ntt.co.jp <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
>>>      NTTソフトウェアイノベーションセンタ
>>>      分散処理基盤技術プロジェクト
>>>      0422-59-2837
>>>      ---------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Dipl.-Inf. Marcus Wagner
>>
>> IT Center
>> Gruppe: Systemgruppe Linux
>> Abteilung: Systeme und Betrieb
>> RWTH Aachen University
>> Seffenter Weg 23
>> 52074 Aachen
>> Tel: +49 241 80-24383
>> Fax: +49 241 80-624383
>> wagner at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>>
>> Social Media Kanäle des IT Centers:
>> https://blog.rwth-aachen.de/itc/
>> https://www.facebook.com/itcenterrwth
>> https://www.linkedin.com/company/itcenterrwth
>> https://twitter.com/ITCenterRWTH
>> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
> 
> 
> 
> 

-- 
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200623/79c4f98f/attachment.bin>