[slurm-users] GRES GPU issues

Brian W. Johanson bjohanso at psc.edu
Mon Dec 3 14:00:44 MST 2018


Is that a lowercase k in k20 specified in the batch script and nodename and a 
uppercase K specified in gres.conf?

On 12/03/2018 09:13 AM, Lou Nicotra wrote:
> Hi All, I have recently set up a slurm cluster with my servers and I'm running 
> into an issue while submitting GPU jobs. It has something to to with gres 
> configurations, but I just can't seem to figure out what is wrong. Non GPU 
> jobs run fine.
>
> The error is as follows:
> sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) 
> specification after submitting a batch job.
>
> My batch job is as follows:
> #!/bin/bash
> #SBATCH --partition=tiger_1   # partition name
> #SBATCH --gres=gpu:k20:1
> #SBATCH --gres-flags=enforce-binding
> #SBATCH --time=0:20:00  # wall clock limit
> #SBATCH --output=gpu-%J.txt
> #SBATCH --account=lnicotra
> module load cuda
> python gpu1
>
> Where gpu1 is a GPU test script that runs correctly while invoked via python. 
> Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and K20 as 
> specified in slurm.conf
>
> I have defined GRES resources in the slurm.conf file:
> # GPU GRES
> GresTypes=gpu
> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>
> And have a local gres.conf on the servers containing GPUs...
> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
> # GPU Definitions
> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20 
> File=/dev/nvidia[0-1]
> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>
> and a similar one for the 1080GTX
> # GPU Definitions
> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX File=/dev/nvidia[0-1]
> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>
> The account manager seems to know about the GPUs...
> lnicotra at tiger11 ~# sacctmgr show tres
>     Type            Name     ID
> -------- --------------- ------
>      cpu                      1
>      mem                      2
>   energy                      3
>     node                      4
>  billing                      5
>       fs            disk      6
>     vmem                      7
>    pages                      8
>     gres             gpu   1001
>     gres         gpu:k20   1002
>     gres     gpu:1080gtx   1003
>
> Can anyone point out what am I missing?
>
> Thanks!
> Lou
>
>
> -- 
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <tel:781-405-5114>
>
> m: 908-451-6983 <tel:781-405-5114>
>
> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
> www.interactions.com <http://www.interactions.com/>
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC 
> proprietary information, which is privileged, confidential, or subject to 
> copyright belonging to the Interactions LLC. This e-mail is intended solely 
> for the use of the individual or entity to which it is addressed. If you are 
> not the intended recipient of this e-mail, you are hereby notified that any 
> dissemination, distribution, copying, or action taken in relation to the 
> contents of and attachments to this e-mail is strictly prohibited and may be 
> unlawful. If you have received this e-mail in error, please notify the sender 
> immediately and permanently delete the original and any copy of this e-mail 
> and any printout. Thank You.
>
> *******************************************************************************
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181203/eb9e4ef7/attachment.html>


More information about the slurm-users mailing list