[slurm-users] How to view GPU indices of the completed jobs?
Kota Tsuyuzaki
kota.tsuyuzaki.pc at hco.ntt.co.jp
Tue Jun 23 02:51:31 UTC 2020
> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
> starts from zero. So this is NOT the index of the GPU.
Thanks. Just FYI, when I tested the environment variables with Slurm 19.05.2 + proctrack/cgroup configuration, It looks CUDA_VISIBLE_DEVICES fits the indices on the host devices (i.e. not started from zero). I'm not sure if the behavior would be changed in the newer Slurm version though.
I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set in environment variables that can be useful. In my current tests, those variables ware being same values with CUDA_VISILE_DEVICES.
Any advices on what I should look for, is always welcome..
Best,
Kota
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Marcus Wagner
> Sent: Tuesday, June 16, 2020 9:17 PM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] How to view GPU indices of the completed jobs?
>
> Hi David,
>
> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
> starts from zero. So this is NOT the index of the GPU.
>
> Just verified it:
> $> nvidia-smi
> Tue Jun 16 13:28:47 2020
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version:
> 10.2 |
> ...
> +-----------------------------------------------------------------------------+
> | Processes: GPU
> Memory |
> | GPU PID Type Process name Usage
> |
> |=========================================================================
> ====|
> | 0 17269 C gmx_mpi
> 679MiB |
> | 1 19246 C gmx_mpi
> 513MiB |
> +-----------------------------------------------------------------------------+
>
> $> squeue -w nrg04
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> 14560009 c18g_low egf5 bk449967 R 1-00:17:48 1 nrg04
> 14560005 c18g_low egf1 bk449967 R 1-00:20:23 1 nrg04
>
>
> $> scontrol show job -d 14560005
> ...
> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
> Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)
>
> $> scontrol show job -d 14560009
> JobId=14560009 JobName=egf5
> ...
> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
> Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)
>
> From the PIDs from nvidia-smi ouput:
>
> $> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE
> CUDA_VISIBLE_DEVICES=0
>
> $> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE
> CUDA_VISIBLE_DEVICES=0
>
>
> So this is only a way to see how MANY devices were used, not which.
>
>
> Best
> Marcus
>
> Am 10.06.2020 um 20:49 schrieb David Braun:
> > Hi Kota,
> >
> > This is from the job template that I give to my users:
> >
> > # Collect some information about the execution environment that may
> > # be useful should we need to do some debugging.
> >
> > echo "CREATING DEBUG DIRECTORY"
> > echo
> >
> > mkdir .debug_info
> > module list > .debug_info/environ_modules 2>&1
> > ulimit -a > .debug_info/limits 2>&1
> > hostname > .debug_info/environ_hostname 2>&1
> > env |grep SLURM > .debug_info/environ_slurm 2>&1
> > env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
> > env |grep OMPI > .debug_info/environ_openmpi 2>&1
> > env > .debug_info/environ 2>&1
> >
> > if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
> > echo "SAVING CUDA ENVIRONMENT"
> > echo
> > env |grep CUDA > .debug_info/environ_cuda 2>&1
> > fi
> >
> > You could add something like this to one of the SLURM prologs to save
> > the GPU list of jobs.
> >
> > Best,
> >
> > David
> >
> > On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki
> > <kota.tsuyuzaki.pc at hco.ntt.co.jp
> > <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>> wrote:
> >
> > Hello Guys,
> >
> > We are running GPU clusters with Slurm and SlurmDBD (version 19.05
> > series) and some of GPUs seemed to get troubles for attached
> > jobs. To investigate if the troubles happened on the same GPUs, I'd
> > like to get GPU indices of the completed jobs.
> >
> > In my understanding `scontrol show job` can show the indices (as IDX
> > in gres info) but cannot be used for completed job. And also
> > `sacct -j` is available for complete jobs but won't print the indices.
> >
> > Is there any way (commands, configurations, etc...) to see the
> > allocated GPU indices for completed jobs?
> >
> > Best regards,
> >
> > --------------------------------------------
> > 露崎 浩太 (Kota Tsuyuzaki)
> > kota.tsuyuzaki.pc at hco.ntt.co.jp <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
> > NTTソフトウェアイノベーションセンタ
> > 分散処理基盤技術プロジェクト
> > 0422-59-2837
> > ---------------------------------------------
> >
> >
> >
> >
> >
>
> --
> Dipl.-Inf. Marcus Wagner
>
> IT Center
> Gruppe: Systemgruppe Linux
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
> Social Media Kanäle des IT Centers:
> https://blog.rwth-aachen.de/itc/
> https://www.facebook.com/itcenterrwth
> https://www.linkedin.com/company/itcenterrwth
> https://twitter.com/ITCenterRWTH
> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
More information about the slurm-users
mailing list