[slurm-users] How to view GPU indices of the completed jobs?
Marcus Wagner
wagner at itc.rwth-aachen.de
Tue Jun 16 12:16:35 UTC 2020
Hi David,
if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
starts from zero. So this is NOT the index of the GPU.
Just verified it:
$> nvidia-smi
Tue Jun 16 13:28:47 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version:
10.2 |
...
+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|
|=============================================================================|
| 0 17269 C gmx_mpi
679MiB |
| 1 19246 C gmx_mpi
513MiB |
+-----------------------------------------------------------------------------+
$> squeue -w nrg04
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
14560009 c18g_low egf5 bk449967 R 1-00:17:48 1 nrg04
14560005 c18g_low egf1 bk449967 R 1-00:20:23 1 nrg04
$> scontrol show job -d 14560005
...
Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)
$> scontrol show job -d 14560009
JobId=14560009 JobName=egf5
...
Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)
From the PIDs from nvidia-smi ouput:
$> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=0
$> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=0
So this is only a way to see how MANY devices were used, not which.
Best
Marcus
Am 10.06.2020 um 20:49 schrieb David Braun:
> Hi Kota,
>
> This is from the job template that I give to my users:
>
> # Collect some information about the execution environment that may
> # be useful should we need to do some debugging.
>
> echo "CREATING DEBUG DIRECTORY"
> echo
>
> mkdir .debug_info
> module list > .debug_info/environ_modules 2>&1
> ulimit -a > .debug_info/limits 2>&1
> hostname > .debug_info/environ_hostname 2>&1
> env |grep SLURM > .debug_info/environ_slurm 2>&1
> env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
> env |grep OMPI > .debug_info/environ_openmpi 2>&1
> env > .debug_info/environ 2>&1
>
> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
> echo "SAVING CUDA ENVIRONMENT"
> echo
> env |grep CUDA > .debug_info/environ_cuda 2>&1
> fi
>
> You could add something like this to one of the SLURM prologs to save
> the GPU list of jobs.
>
> Best,
>
> David
>
> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki
> <kota.tsuyuzaki.pc at hco.ntt.co.jp
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>> wrote:
>
> Hello Guys,
>
> We are running GPU clusters with Slurm and SlurmDBD (version 19.05
> series) and some of GPUs seemed to get troubles for attached
> jobs. To investigate if the troubles happened on the same GPUs, I'd
> like to get GPU indices of the completed jobs.
>
> In my understanding `scontrol show job` can show the indices (as IDX
> in gres info) but cannot be used for completed job. And also
> `sacct -j` is available for complete jobs but won't print the indices.
>
> Is there any way (commands, configurations, etc...) to see the
> allocated GPU indices for completed jobs?
>
> Best regards,
>
> --------------------------------------------
> 露崎 浩太 (Kota Tsuyuzaki)
> kota.tsuyuzaki.pc at hco.ntt.co.jp <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
> NTTソフトウェアイノベーションセンタ
> 分散処理基盤技術プロジェクト
> 0422-59-2837
> ---------------------------------------------
>
>
>
>
>
--
Dipl.-Inf. Marcus Wagner
IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200616/d0af5660/attachment.bin>
More information about the slurm-users
mailing list