[slurm-users] How to view GPU indices of the completed jobs?

Tue Jun 23 02:51:31 UTC 2020

> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
> starts from zero. So this is NOT the index of the GPU.

Thanks. Just FYI, when I tested the environment variables with Slurm 19.05.2 + proctrack/cgroup configuration, It looks CUDA_VISIBLE_DEVICES fits the indices on the host devices (i.e. not started from zero). I'm not sure if the behavior would be changed in the newer Slurm version though.

I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set in environment variables that can be useful. In my current tests, those variables ware being same values with CUDA_VISILE_DEVICES.

Any advices on what I should look for, is always welcome..

Best,
Kota

> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Marcus Wagner
> Sent: Tuesday, June 16, 2020 9:17 PM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] How to view GPU indices of the completed jobs?
> 
> Hi David,
> 
> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
> starts from zero. So this is NOT the index of the GPU.
> 
> Just verified it:
> $> nvidia-smi
> Tue Jun 16 13:28:47 2020
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version:
> 10.2     |
> ...
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU
> Memory |
> |  GPU       PID   Type   Process name                             Usage
>       |
> |=========================================================================
> ====|
> |    0     17269      C   gmx_mpi
> 679MiB |
> |    1     19246      C   gmx_mpi
> 513MiB |
> +-----------------------------------------------------------------------------+
> 
> $> squeue -w nrg04
>               JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>            14560009  c18g_low     egf5 bk449967  R 1-00:17:48      1 nrg04
>            14560005  c18g_low     egf1 bk449967  R 1-00:20:23      1 nrg04
> 
> 
> $> scontrol show job -d 14560005
> ...
>     Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
>       Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)
> 
> $> scontrol show job -d 14560009
> JobId=14560009 JobName=egf5
> ...
>     Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
>       Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)
> 
>  From the PIDs from nvidia-smi ouput:
> 
> $> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE
> CUDA_VISIBLE_DEVICES=0
> 
> $> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE
> CUDA_VISIBLE_DEVICES=0
> 
> 
> So this is only a way to see how MANY devices were used, not which.
> 
> 
> Best
> Marcus
> 
> Am 10.06.2020 um 20:49 schrieb David Braun:
> > Hi Kota,
> >
> > This is from the job template that I give to my users:
> >
> > # Collect some information about the execution environment that may
> > # be useful should we need to do some debugging.
> >
> > echo "CREATING DEBUG DIRECTORY"
> > echo
> >
> > mkdir .debug_info
> > module list > .debug_info/environ_modules 2>&1
> > ulimit -a > .debug_info/limits 2>&1
> > hostname > .debug_info/environ_hostname 2>&1
> > env |grep SLURM > .debug_info/environ_slurm 2>&1
> > env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
> > env |grep OMPI > .debug_info/environ_openmpi 2>&1
> > env > .debug_info/environ 2>&1
> >
> > if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
> >          echo "SAVING CUDA ENVIRONMENT"
> >          echo
> >          env |grep CUDA > .debug_info/environ_cuda 2>&1
> > fi
> >
> > You could add something like this to one of the SLURM prologs to save
> > the GPU list of jobs.
> >
> > Best,
> >
> > David
> >
> > On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki
> > <kota.tsuyuzaki.pc at hco.ntt.co.jp
> > <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>> wrote:
> >
> >     Hello Guys,
> >
> >     We are running GPU clusters with Slurm and SlurmDBD (version 19.05
> >     series) and some of GPUs seemed to get troubles for attached
> >     jobs. To investigate if the troubles happened on the same GPUs, I'd
> >     like to get GPU indices of the completed jobs.
> >
> >     In my understanding `scontrol show job` can show the indices (as IDX
> >     in gres info) but cannot be used for completed job. And also
> >     `sacct -j` is available for complete jobs but won't print the indices.
> >
> >     Is there any way (commands, configurations, etc...) to see the
> >     allocated GPU indices for completed jobs?
> >
> >     Best regards,
> >
> >     --------------------------------------------
> >     露崎　浩太 (Kota Tsuyuzaki)
> >     kota.tsuyuzaki.pc at hco.ntt.co.jp <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
> >     NTTソフトウェアイノベーションセンタ
> >     分散処理基盤技術プロジェクト
> >     0422-59-2837
> >     ---------------------------------------------
> >
> >
> >
> >
> >
> 
> --
> Dipl.-Inf. Marcus Wagner
> 
> IT Center
> Gruppe: Systemgruppe Linux
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
> 
> Social Media Kanäle des IT Centers:
> https://blog.rwth-aachen.de/itc/
> https://www.facebook.com/itcenterrwth
> https://www.linkedin.com/company/itcenterrwth
> https://twitter.com/ITCenterRWTH
> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ