[slurm-users] How to view GPU indices of the completed jobs?
Marcus Wagner
wagner at itc.rwth-aachen.de
Wed Jun 24 05:05:53 UTC 2020
Hi Taras,
no we have set ConstrainDevices to "yes".
And this is, why CUDA_VISIBLE_DEVICES starts from zero.
Otherwise both below mentioned jobs would have been on one GPU, but as
nvidia-smi shows clearly (did not show the output this time, see earlier
post), both GPUs are used, environment of both jobs includes
CUDA_VISIBLE_DEVICES=0.
Kota, might it be, that you did not configure ConstrainDevices in
cgroup.conf? The default is "no" according to the manpage.
That way, a user could set CUDA_VISIBLE_DEVICES in his job and therefore
use GPUs they did not request.
Best
Marcus
Am 23.06.2020 um 15:41 schrieb Taras Shapovalov:
> Hi Marcus,
>
> This may depend on ConstrainDevices in cgroups.conf. I guess it is set
> to "no" in your case.
>
> Best regards,
> Taras
>
> On Tue, Jun 23, 2020 at 4:02 PM Marcus Wagner <wagner at itc.rwth-aachen.de
> <mailto:wagner at itc.rwth-aachen.de>> wrote:
>
> Hi Kota,
>
> thanks for the hint.
>
> Yet, I'm still a little bit astonished, as if I remember right,
> CUDA_VISIBLE_DEVICES in a cgroup always start from zero. That has been
> already years ago, as we still used LSF.
>
> But SLURM_JOB_GPUS seems to be the right thing:
>
> same node, two different users (and therefore jobs)
>
>
> $> xargs --null --max-args=1 echo < /proc/32719/environ | egrep
> "GPU|CUDA"
> SLURM_JOB_GPUS=0
> CUDA_VISIBLE_DEVICES=0
> GPU_DEVICE_ORDINAL=0
>
> $> xargs --null --max-args=1 echo < /proc/109479/environ | egrep
> "GPU|CUDA"
> SLURM_MEM_PER_GPU=6144
> SLURM_JOB_GPUS=1
> CUDA_VISIBLE_DEVICES=0
> GPU_DEVICE_ORDINAL=0
> CUDA_ROOT=/usr/local_rwth/sw/cuda/10.1.243
> CUDA_PATH=/usr/local_rwth/sw/cuda/10.1.243
> CUDA_VERSION=101
>
> SLURM_JOB_GPU differs
>
> $> scontrol show -d job 14658274
> ...
> Nodes=nrg02 CPU_IDs=24 Mem=8192 GRES_IDX=gpu:volta(IDX:1)
>
> $> scontrol show -d job 14673550
> ...
> Nodes=nrg02 CPU_IDs=0 Mem=8192 GRES_IDX=gpu:volta(IDX:0)
>
>
>
> Is there anyone out there, who can confirm this besides me?
>
>
> Best
> Marcus
>
>
> Am 23.06.2020 um 04:51 schrieb Kota Tsuyuzaki:
> >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
> >> starts from zero. So this is NOT the index of the GPU.
> >
> > Thanks. Just FYI, when I tested the environment variables with
> Slurm 19.05.2 + proctrack/cgroup configuration, It looks
> CUDA_VISIBLE_DEVICES fits the indices on the host devices (i.e. not
> started from zero). I'm not sure if the behavior would be changed in
> the newer Slurm version though.
> >
> > I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set
> in environment variables that can be useful. In my current tests,
> those variables ware being same values with CUDA_VISILE_DEVICES.
> >
> > Any advices on what I should look for, is always welcome..
> >
> > Best,
> > Kota
> >
> >> -----Original Message-----
> >> From: slurm-users <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Marcus
> Wagner
> >> Sent: Tuesday, June 16, 2020 9:17 PM
> >> To: slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>
> >> Subject: Re: [slurm-users] How to view GPU indices of the
> completed jobs?
> >>
> >> Hi David,
> >>
> >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
> >> starts from zero. So this is NOT the index of the GPU.
> >>
> >> Just verified it:
> >> $> nvidia-smi
> >> Tue Jun 16 13:28:47 2020
> >>
> +-----------------------------------------------------------------------------+
> >> | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version:
> >> 10.2 |
> >> ...
> >>
> +-----------------------------------------------------------------------------+
> >> | Processes:
> GPU
> >> Memory |
> >> | GPU PID Type Process name
> Usage
> >> |
> >>
> |=========================================================================
> >> ====|
> >> | 0 17269 C gmx_mpi
> >> 679MiB |
> >> | 1 19246 C gmx_mpi
> >> 513MiB |
> >>
> +-----------------------------------------------------------------------------+
> >>
> >> $> squeue -w nrg04
> >> JOBID PARTITION NAME USER ST TIME
> NODES
> >> NODELIST(REASON)
> >> 14560009 c18g_low egf5 bk449967 R 1-00:17:48
> 1 nrg04
> >> 14560005 c18g_low egf1 bk449967 R 1-00:20:23
> 1 nrg04
> >>
> >>
> >> $> scontrol show job -d 14560005
> >> ...
> >> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
> >> Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)
> >>
> >> $> scontrol show job -d 14560009
> >> JobId=14560009 JobName=egf5
> >> ...
> >> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
> >> Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)
> >>
> >> From the PIDs from nvidia-smi ouput:
> >>
> >> $> xargs --null --max-args=1 echo < /proc/17269/environ | grep
> CUDA_VISIBLE
> >> CUDA_VISIBLE_DEVICES=0
> >>
> >> $> xargs --null --max-args=1 echo < /proc/19246/environ | grep
> CUDA_VISIBLE
> >> CUDA_VISIBLE_DEVICES=0
> >>
> >>
> >> So this is only a way to see how MANY devices were used, not which.
> >>
> >>
> >> Best
> >> Marcus
> >>
> >> Am 10.06.2020 um 20:49 schrieb David Braun:
> >>> Hi Kota,
> >>>
> >>> This is from the job template that I give to my users:
> >>>
> >>> # Collect some information about the execution environment that may
> >>> # be useful should we need to do some debugging.
> >>>
> >>> echo "CREATING DEBUG DIRECTORY"
> >>> echo
> >>>
> >>> mkdir .debug_info
> >>> module list > .debug_info/environ_modules 2>&1
> >>> ulimit -a > .debug_info/limits 2>&1
> >>> hostname > .debug_info/environ_hostname 2>&1
> >>> env |grep SLURM > .debug_info/environ_slurm 2>&1
> >>> env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
> >>> env |grep OMPI > .debug_info/environ_openmpi 2>&1
> >>> env > .debug_info/environ 2>&1
> >>>
> >>> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
> >>> echo "SAVING CUDA ENVIRONMENT"
> >>> echo
> >>> env |grep CUDA > .debug_info/environ_cuda 2>&1
> >>> fi
> >>>
> >>> You could add something like this to one of the SLURM prologs
> to save
> >>> the GPU list of jobs.
> >>>
> >>> Best,
> >>>
> >>> David
> >>>
> >>> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki
> >>> <kota.tsuyuzaki.pc at hco.ntt.co.jp
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
> >>> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>>> wrote:
> >>>
> >>> Hello Guys,
> >>>
> >>> We are running GPU clusters with Slurm and SlurmDBD
> (version 19.05
> >>> series) and some of GPUs seemed to get troubles for attached
> >>> jobs. To investigate if the troubles happened on the same
> GPUs, I'd
> >>> like to get GPU indices of the completed jobs.
> >>>
> >>> In my understanding `scontrol show job` can show the
> indices (as IDX
> >>> in gres info) but cannot be used for completed job. And also
> >>> `sacct -j` is available for complete jobs but won't print
> the indices.
> >>>
> >>> Is there any way (commands, configurations, etc...) to see the
> >>> allocated GPU indices for completed jobs?
> >>>
> >>> Best regards,
> >>>
> >>> --------------------------------------------
> >>> 露崎 浩太 (Kota Tsuyuzaki)
> >>> kota.tsuyuzaki.pc at hco.ntt.co.jp
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>>
> >>> NTTソフトウェアイノベーションセンタ
> >>> 分散処理基盤技術プロジェクト
> >>> 0422-59-2837
> >>> ---------------------------------------------
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >> --
> >> Dipl.-Inf. Marcus Wagner
> >>
> >> IT Center
> >> Gruppe: Systemgruppe Linux
> >> Abteilung: Systeme und Betrieb
> >> RWTH Aachen University
> >> Seffenter Weg 23
> >> 52074 Aachen
> >> Tel: +49 241 80-24383
> >> Fax: +49 241 80-624383
> >> wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>
> >> www.itc.rwth-aachen.de <http://www.itc.rwth-aachen.de>
> >>
> >> Social Media Kanäle des IT Centers:
> >> https://blog.rwth-aachen.de/itc/
> >> https://www.facebook.com/itcenterrwth
> >> https://www.linkedin.com/company/itcenterrwth
> >> https://twitter.com/ITCenterRWTH
> >> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
> >
> >
> >
> >
>
> --
> Dipl.-Inf. Marcus Wagner
>
> IT Center
> Gruppe: Systemgruppe Linux
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>
> www.itc.rwth-aachen.de <http://www.itc.rwth-aachen.de>
>
> Social Media Kanäle des IT Centers:
> https://blog.rwth-aachen.de/itc/
> https://www.facebook.com/itcenterrwth
> https://www.linkedin.com/company/itcenterrwth
> https://twitter.com/ITCenterRWTH
> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
>
--
Dipl.-Inf. Marcus Wagner
IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200624/82f1982e/attachment-0001.bin>
More information about the slurm-users
mailing list