[slurm-users] How to view GPU indices of the completed jobs?

Kota Tsuyuzaki kota.tsuyuzaki.pc at hco.ntt.co.jp
Fri Jun 12 02:22:41 UTC 2020


Thank you David! Let me try it.
Thinking about our case, I'll try to dump the debug info to somewhere like syslog. Anyway, the idea should be useful to improve our system monitoring. Much appreciated.

Best,
Kota 

--------------------------------------------
露崎 浩太 (Kota Tsuyuzaki)
kota.tsuyuzaki.pc at hco.ntt.co.jp
NTTソフトウェアイノベーションセンタ
分散処理基盤技術プロジェクト
0422-59-2837
---------------------------------------------

> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of David Braun
> Sent: Thursday, June 11, 2020 3:50 AM
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] How to view GPU indices of the completed jobs?
> 
> Hi Kota,
> 
> This is from the job template that I give to my users:
> 
> # Collect some information about the execution environment that may # be useful should we need to do some debugging.
> 
> echo "CREATING DEBUG DIRECTORY"
> echo
> 
> mkdir .debug_info
> module list > .debug_info/environ_modules 2>&1 ulimit -a > .debug_info/limits 2>&1 hostname
> > .debug_info/environ_hostname 2>&1 env |grep SLURM > .debug_info/environ_slurm 2>&1 env |grep OMP |grep -v
> OMPI > .debug_info/environ_omp 2>&1 env |grep OMPI > .debug_info/environ_openmpi 2>&1 env
> > .debug_info/environ 2>&1
> 
> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
>         echo "SAVING CUDA ENVIRONMENT"
>         echo
>         env |grep CUDA > .debug_info/environ_cuda 2>&1 fi
> 
> You could add something like this to one of the SLURM prologs to save the GPU list of jobs.
> 
> Best,
> 
> David
> 
> 
> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki <kota.tsuyuzaki.pc at hco.ntt.co.jp
> <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp> > wrote:
> 
> 
> 	Hello Guys,
> 
> 	We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series) and some of GPUs seemed to get
> troubles for attached
> 	jobs. To investigate if the troubles happened on the same GPUs, I'd like to get GPU indices of the completed jobs.
> 
> 	In my understanding `scontrol show job` can show the indices (as IDX in gres info) but cannot be used for
> completed job. And also
> 	`sacct -j` is available for complete jobs but won't print the indices.
> 
> 	Is there any way (commands, configurations, etc...) to see the allocated GPU indices for completed jobs?
> 
> 	Best regards,
> 
> 	--------------------------------------------
> 	露崎 浩太 (Kota Tsuyuzaki)
> 	kota.tsuyuzaki.pc at hco.ntt.co.jp <mailto:kota.tsuyuzaki.pc at hco.ntt.co.jp>
> 	NTTソフトウェアイノベーションセンタ
> 	分散処理基盤技術プロジェクト
> 	0422-59-2837
> 	---------------------------------------------
> 
> 
> 
> 
> 
> 






More information about the slurm-users mailing list