[slurm-users] GRES GPU issues

Michael Di Domenico mdidomenico4 at gmail.com
Tue Dec 4 07:06:08 MST 2018


unfortunately, someone smarter then me will have to help further.  I'm
not sure i see anything specifically wrong.  The one thing i might try
is backing the software down to a 17.x release series.  I recently
tried 18.x and had some issues.  I can't say whether it'll be any
different, but you might be exposing an undiagnosed bug in the 18.x
branch
On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra <lnicotra at interactions.com> wrote:
>
> Made the change in the gres.conf on local server file and restarted slurmd and slurmctld on master.... Unfortunately same error...
>
> Distributed corrected gres.conf to all k20 servers, restarted slurmd and slurmdctl...   Still has same error...
>
> On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson <bjohanso at psc.edu> wrote:
>>
>> Is that a lowercase k in k20 specified in the batch script and nodename and a uppercase K specified in gres.conf?
>>
>> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>>
>> Hi All, I have recently set up a slurm cluster with my servers and I'm running into an issue while submitting GPU jobs. It has something to to with gres configurations, but I just can't seem to figure out what is wrong. Non GPU jobs run fine.
>>
>> The error is as follows:
>> sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) specification  after submitting a batch job.
>>
>> My batch job is as follows:
>> #!/bin/bash
>> #SBATCH --partition=tiger_1   # partition name
>> #SBATCH --gres=gpu:k20:1
>> #SBATCH --gres-flags=enforce-binding
>> #SBATCH --time=0:20:00  # wall clock limit
>> #SBATCH --output=gpu-%J.txt
>> #SBATCH --account=lnicotra
>> module load cuda
>> python gpu1
>>
>> Where gpu1 is a GPU test script that runs correctly while invoked via python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and K20 as specified in slurm.conf
>>
>> I have defined GRES resources in the slurm.conf file:
>> # GPU GRES
>> GresTypes=gpu
>> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>>
>> And have a local gres.conf on the servers containing GPUs...
>> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>> # GPU Definitions
>> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20 File=/dev/nvidia[0-1]
>> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>>
>> and a similar one for the 1080GTX
>> # GPU Definitions
>> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX File=/dev/nvidia[0-1]
>> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>>
>> The account manager seems to know about the GPUs...
>> lnicotra at tiger11 ~# sacctmgr show tres
>>     Type            Name     ID
>> -------- --------------- ------
>>      cpu                      1
>>      mem                      2
>>   energy                      3
>>     node                      4
>>  billing                      5
>>       fs            disk      6
>>     vmem                      7
>>    pages                      8
>>     gres             gpu   1001
>>     gres         gpu:k20   1002
>>     gres     gpu:1080gtx   1003
>>
>> Can anyone point out what am I missing?
>>
>> Thanks!
>> Lou
>>
>>
>> --
>>
>> Lou Nicotra
>>
>> IT Systems Engineer - SLT
>>
>> Interactions LLC
>>
>> o:  908-673-1833
>>
>> m: 908-451-6983
>>
>> lnicotra at interactions.com
>>
>> www.interactions.com
>>
>> *******************************************************************************
>>
>> This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.
>>
>> *******************************************************************************
>>
>>
>
>
> --
>
> Lou Nicotra
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o:  908-673-1833
>
> m: 908-451-6983
>
> lnicotra at interactions.com
>
> www.interactions.com
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.
>
> *******************************************************************************



More information about the slurm-users mailing list