[slurm-users] How to view GPU indices of the completed jobs?
    Marcus Wagner 
    wagner at itc.rwth-aachen.de
       
    Wed Jun 24 05:05:53 UTC 2020
    
    
  
Hi Taras,
no we have set ConstrainDevices to "yes".
And this is, why CUDA_VISIBLE_DEVICES starts from zero.
Otherwise both below mentioned jobs would have been on one GPU, but as 
nvidia-smi shows clearly (did not show the output this time, see earlier 
post), both GPUs are used, environment of both jobs includes 
CUDA_VISIBLE_DEVICES=0.
Kota, might it be, that you did not configure ConstrainDevices in 
cgroup.conf? The default is "no" according to the manpage.
That way, a user could set CUDA_VISIBLE_DEVICES in his job and therefore 
use GPUs they did not request.
Best
Marcus
Am 23.06.2020 um 15:41 schrieb Taras Shapovalov:
> Hi Marcus,
> 
> This may depend on ConstrainDevices in cgroups.conf. I guess it is set 
> to "no" in your case.
> 
> Best regards,
> Taras
> 
> On Tue, Jun 23, 2020 at 4:02 PM Marcus Wagner <wagner at itc.rwth-aachen.de 
> <mailto:wagner at itc.rwth-aachen.de>> wrote:
> 
>     Hi Kota,
> 
>     thanks for the hint.
> 
>     Yet, I'm still a little bit astonished, as if I remember right,
>     CUDA_VISIBLE_DEVICES in a cgroup always start from zero. That has been
>     already years ago, as we still used LSF.
> 
>     But SLURM_JOB_GPUS seems to be the right thing:
> 
>     same node, two different users (and therefore jobs)
> 
> 
>     $> xargs --null --max-args=1 echo < /proc/32719/environ | egrep
>     "GPU|CUDA"
>     SLURM_JOB_GPUS=0
>     CUDA_VISIBLE_DEVICES=0
>     GPU_DEVICE_ORDINAL=0
> 
>     $> xargs --null --max-args=1 echo < /proc/109479/environ | egrep
>     "GPU|CUDA"
>     SLURM_MEM_PER_GPU=6144
>     SLURM_JOB_GPUS=1
>     CUDA_VISIBLE_DEVICES=0
>     GPU_DEVICE_ORDINAL=0
>     CUDA_ROOT=/usr/local_rwth/sw/cuda/10.1.243
>     CUDA_PATH=/usr/local_rwth/sw/cuda/10.1.243
>     CUDA_VERSION=101
> 
>     SLURM_JOB_GPU differs
> 
>     $> scontrol show -d job 14658274
>     ...
>     Nodes=nrg02 CPU_IDs=24 Mem=8192 GRES_IDX=gpu:volta(IDX:1)
> 
>     $> scontrol show -d job 14673550
>     ...
>     Nodes=nrg02 CPU_IDs=0 Mem=8192 GRES_IDX=gpu:volta(IDX:0)
> 
> 
> 
>     Is there anyone out there, who can confirm this besides me?
> 
> 
>     Best
>     Marcus
> 
> 
>     Am 23.06.2020 um 04:51 schrieb Kota Tsuyuzaki:
>      >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
>      >> starts from zero. So this is NOT the index of the GPU.
>      >
>      > Thanks. Just FYI, when I tested the environment variables with
>     Slurm 19.05.2 + proctrack/cgroup configuration, It looks
>     CUDA_VISIBLE_DEVICES fits the indices on the host devices (i.e. not
>     started from zero). I'm not sure if the behavior would be changed in
>     the newer Slurm version though.
>      >
>      > I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set
>     in environment variables that can be useful. In my current tests,
>     those variables ware being same values with CUDA_VISILE_DEVICES.
>      >
>      > Any advices on what I should look for, is always welcome..
>      >
>      > Best,
>      > Kota
>      >
>      >> -----Original Message-----
>      >> From: slurm-users <slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Marcus
>     Wagner
>      >> Sent: Tuesday, June 16, 2020 9:17 PM
>      >> To: slurm-users at lists.schedmd.com
>     <mailto:slurm-users at lists.schedmd.com>
>      >> Subject: Re: [slurm-users] How to view GPU indices of the
>     completed jobs?
>      >>
>      >> Hi David,
>      >>
>      >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
>      >> starts from zero. So this is NOT the index of the GPU.
>      >>
>      >> Just verified it:
>      >> $> nvidia-smi
>      >> Tue Jun 16 13:28:47 2020
>      >>
>     +-----------------------------------------------------------------------------+
>      >> | NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version:
>      >> 10.2     |
>      >> ...
>      >>
>     +-----------------------------------------------------------------------------+
>      >> | Processes:                                                   
>         GPU
>      >> Memory |
>      >> |  GPU       PID   Type   Process name                         
>         Usage
>      >>        |
>      >>
>     |=========================================================================
>      >> ====|
>      >> |    0     17269      C   gmx_mpi
>      >> 679MiB |
>      >> |    1     19246      C   gmx_mpi
>      >> 513MiB |
>      >>
>     +-----------------------------------------------------------------------------+
>      >>
>      >> $> squeue -w nrg04
>      >>                JOBID PARTITION     NAME     USER ST       TIME 
>     NODES
>      >> NODELIST(REASON)
>      >>             14560009  c18g_low     egf5 bk449967  R 1-00:17:48 
>          1 nrg04
>      >>             14560005  c18g_low     egf1 bk449967  R 1-00:20:23 
>          1 nrg04
>      >>
>      >>
>      >> $> scontrol show job -d 14560005
>      >> ...
>      >>      Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
>      >>        Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)
>      >>
>      >> $> scontrol show job -d 14560009
>      >> JobId=14560009 JobName=egf5
>      >> ...
>      >>      Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
>      >>        Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)
>      >>
>      >>   From the PIDs from nvidia-smi ouput:
>      >>
>      >> $> xargs --null --max-args=1 echo < /proc/17269/environ | grep
>     CUDA_VISIBLE
>      >> CUDA_VISIBLE_DEVICES=0
>      >>
>      >> $> xargs --null --max-args=1 echo < /proc/19246/environ | grep
>     CUDA_VISIBLE
>      >> CUDA_VISIBLE_DEVICES=0
>      >>
>      >>
>      >> So this is only a way to see how MANY devices were used, not which.
>      >>
>      >>
>      >> Best
>      >> Marcus
>      >>
>      >> Am 10.06.2020 um 20:49 schrieb David Braun:
>      >>> Hi Kota,
>      >>>
>      >>> This is from the job template that I give to my users:
>      >>>
>      >>> # Collect some information about the execution environment that may
>      >>> # be useful should we need to do some debugging.
>      >>>
>      >>> echo "CREATING DEBUG DIRECTORY"
>      >>> echo
>      >>>
>      >>> mkdir .debug_info
>      >>> module list > .debug_info/environ_modules 2>&1
>      >>> ulimit -a > .debug_info/limits 2>&1
>      >>> hostname > .debug_info/environ_hostname 2>&1
>      >>> env |grep SLURM > .debug_info/environ_slurm 2>&1
>      >>> env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
>      >>> env |grep OMPI > .debug_info/environ_openmpi 2>&1
>      >>> env > .debug_info/environ 2>&1
>      >>>
>      >>> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
>      >>>           echo "SAVING CUDA ENVIRONMENT"
>      >>>           echo
>      >>>           env |grep CUDA > .debug_info/environ_cuda 2>&1
>      >>> fi
>      >>>
>      >>> You could add something like this to one of the SLURM prologs
>     to save
>      >>> the GPU list of jobs.
>      >>>
>      >>> Best,
>      >>>
>      >>> David
>      >>>
>      >>> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki
>      >>> <kota.tsuyuzaki.pc at hco.ntt.co.jp
>     <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
>      >>> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp
>     <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>>> wrote:
>      >>>
>      >>>      Hello Guys,
>      >>>
>      >>>      We are running GPU clusters with Slurm and SlurmDBD
>     (version 19.05
>      >>>      series) and some of GPUs seemed to get troubles for attached
>      >>>      jobs. To investigate if the troubles happened on the same
>     GPUs, I'd
>      >>>      like to get GPU indices of the completed jobs.
>      >>>
>      >>>      In my understanding `scontrol show job` can show the
>     indices (as IDX
>      >>>      in gres info) but cannot be used for completed job. And also
>      >>>      `sacct -j` is available for complete jobs but won't print
>     the indices.
>      >>>
>      >>>      Is there any way (commands, configurations, etc...) to see the
>      >>>      allocated GPU indices for completed jobs?
>      >>>
>      >>>      Best regards,
>      >>>
>      >>>      --------------------------------------------
>      >>>      露崎 浩太 (Kota Tsuyuzaki)
>      >>> kota.tsuyuzaki.pc at hco.ntt.co.jp
>     <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
>     <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp
>     <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>>
>      >>>      NTTソフトウェアイノベーションセンタ
>      >>>      分散処理基盤技術プロジェクト
>      >>>      0422-59-2837
>      >>>      ---------------------------------------------
>      >>>
>      >>>
>      >>>
>      >>>
>      >>>
>      >>
>      >> --
>      >> Dipl.-Inf. Marcus Wagner
>      >>
>      >> IT Center
>      >> Gruppe: Systemgruppe Linux
>      >> Abteilung: Systeme und Betrieb
>      >> RWTH Aachen University
>      >> Seffenter Weg 23
>      >> 52074 Aachen
>      >> Tel: +49 241 80-24383
>      >> Fax: +49 241 80-624383
>      >> wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>
>      >> www.itc.rwth-aachen.de <http://www.itc.rwth-aachen.de>
>      >>
>      >> Social Media Kanäle des IT Centers:
>      >> https://blog.rwth-aachen.de/itc/
>      >> https://www.facebook.com/itcenterrwth
>      >> https://www.linkedin.com/company/itcenterrwth
>      >> https://twitter.com/ITCenterRWTH
>      >> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
>      >
>      >
>      >
>      >
> 
>     -- 
>     Dipl.-Inf. Marcus Wagner
> 
>     IT Center
>     Gruppe: Systemgruppe Linux
>     Abteilung: Systeme und Betrieb
>     RWTH Aachen University
>     Seffenter Weg 23
>     52074 Aachen
>     Tel: +49 241 80-24383
>     Fax: +49 241 80-624383
>     wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>
>     www.itc.rwth-aachen.de <http://www.itc.rwth-aachen.de>
> 
>     Social Media Kanäle des IT Centers:
>     https://blog.rwth-aachen.de/itc/
>     https://www.facebook.com/itcenterrwth
>     https://www.linkedin.com/company/itcenterrwth
>     https://twitter.com/ITCenterRWTH
>     https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
> 
-- 
Dipl.-Inf. Marcus Wagner
IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200624/82f1982e/attachment-0001.bin>
    
    
More information about the slurm-users
mailing list