[slurm-users] GRES GPU issues

Michael Di Domenico mdidomenico4 at gmail.com
Mon Dec 3 09:33:06 MST 2018


do you get anything additional in the slurm logs?  have you tried
adding gres to the debugflags?  what version of slurm are you running?
On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra <lnicotra at interactions.com> wrote:
>
> Hi All, I have recently set up a slurm cluster with my servers and I'm running into an issue while submitting GPU jobs. It has something to to with gres configurations, but I just can't seem to figure out what is wrong. Non GPU jobs run fine.
>
> The error is as follows:
> sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) specification  after submitting a batch job.
>
> My batch job is as follows:
> #!/bin/bash
> #SBATCH --partition=tiger_1   # partition name
> #SBATCH --gres=gpu:k20:1
> #SBATCH --gres-flags=enforce-binding
> #SBATCH --time=0:20:00  # wall clock limit
> #SBATCH --output=gpu-%J.txt
> #SBATCH --account=lnicotra
> module load cuda
> python gpu1
>
> Where gpu1 is a GPU test script that runs correctly while invoked via python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and K20 as specified in slurm.conf
>
> I have defined GRES resources in the slurm.conf file:
> # GPU GRES
> GresTypes=gpu
> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>
> And have a local gres.conf on the servers containing GPUs...
> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
> # GPU Definitions
> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20 File=/dev/nvidia[0-1]
> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>
> and a similar one for the 1080GTX
> # GPU Definitions
> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX File=/dev/nvidia[0-1]
> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>
> The account manager seems to know about the GPUs...
> lnicotra at tiger11 ~# sacctmgr show tres
>     Type            Name     ID
> -------- --------------- ------
>      cpu                      1
>      mem                      2
>   energy                      3
>     node                      4
>  billing                      5
>       fs            disk      6
>     vmem                      7
>    pages                      8
>     gres             gpu   1001
>     gres         gpu:k20   1002
>     gres     gpu:1080gtx   1003
>
> Can anyone point out what am I missing?
>
> Thanks!
> Lou
>
>
> --
>
> Lou Nicotra
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o:  908-673-1833
>
> m: 908-451-6983
>
> lnicotra at interactions.com
>
> www.interactions.com
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.
>
> *******************************************************************************



More information about the slurm-users mailing list