[slurm-users] How to view GPU indices of the completed jobs?

Marcus Wagner wagner at itc.rwth-aachen.de
Tue Jun 16 12:16:35 UTC 2020


Hi David,

if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always 
starts from zero. So this is NOT the index of the GPU.

Just verified it:
$> nvidia-smi
Tue Jun 16 13:28:47 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 
10.2     |
...
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU 
Memory |
|  GPU       PID   Type   Process name                             Usage 
      |
|=============================================================================|
|    0     17269      C   gmx_mpi 
679MiB |
|    1     19246      C   gmx_mpi 
513MiB |
+-----------------------------------------------------------------------------+

$> squeue -w nrg04
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
           14560009  c18g_low     egf5 bk449967  R 1-00:17:48      1 nrg04
           14560005  c18g_low     egf1 bk449967  R 1-00:20:23      1 nrg04


$> scontrol show job -d 14560005
...
    Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
      Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)

$> scontrol show job -d 14560009
JobId=14560009 JobName=egf5
...
    Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
      Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)

 From the PIDs from nvidia-smi ouput:

$> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=0

$> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=0


So this is only a way to see how MANY devices were used, not which.


Best
Marcus

Am 10.06.2020 um 20:49 schrieb David Braun:
> Hi Kota,
> 
> This is from the job template that I give to my users:
> 
> # Collect some information about the execution environment that may
> # be useful should we need to do some debugging.
> 
> echo "CREATING DEBUG DIRECTORY"
> echo
> 
> mkdir .debug_info
> module list > .debug_info/environ_modules 2>&1
> ulimit -a > .debug_info/limits 2>&1
> hostname > .debug_info/environ_hostname 2>&1
> env |grep SLURM > .debug_info/environ_slurm 2>&1
> env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
> env |grep OMPI > .debug_info/environ_openmpi 2>&1
> env > .debug_info/environ 2>&1
> 
> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
>          echo "SAVING CUDA ENVIRONMENT"
>          echo
>          env |grep CUDA > .debug_info/environ_cuda 2>&1
> fi
> 
> You could add something like this to one of the SLURM prologs to save 
> the GPU list of jobs.
> 
> Best,
> 
> David
> 
> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki 
> <kota.tsuyuzaki.pc at hco.ntt.co.jp 
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>> wrote:
> 
>     Hello Guys,
> 
>     We are running GPU clusters with Slurm and SlurmDBD (version 19.05
>     series) and some of GPUs seemed to get troubles for attached
>     jobs. To investigate if the troubles happened on the same GPUs, I'd
>     like to get GPU indices of the completed jobs.
> 
>     In my understanding `scontrol show job` can show the indices (as IDX
>     in gres info) but cannot be used for completed job. And also
>     `sacct -j` is available for complete jobs but won't print the indices.
> 
>     Is there any way (commands, configurations, etc...) to see the
>     allocated GPU indices for completed jobs?
> 
>     Best regards,
> 
>     --------------------------------------------
>     露崎 浩太 (Kota Tsuyuzaki)
>     kota.tsuyuzaki.pc at hco.ntt.co.jp <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
>     NTTソフトウェアイノベーションセンタ
>     分散処理基盤技術プロジェクト
>     0422-59-2837
>     ---------------------------------------------
> 
> 
> 
> 
> 

-- 
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200616/d0af5660/attachment.bin>


More information about the slurm-users mailing list