[slurm-users] GRES GPU issues

Brian W. Johanson bjohanso at psc.edu
Tue Dec 4 07:26:21 MST 2018


As Michael had suggested earlier, debugflags=gpu will give you detailed output 
of the gres being reported by the nodes.  This would be in the slurmctld log.

Or, show us the output of 'scontrol show node=tiger[01-02]' and 'scontrol show 
partition=tiger_1'
 From your previous message, that should be a node with a 1080gtx, a k20, and 
the partition you are submitting to.

-b

On 12/04/2018 09:06 AM, Michael Di Domenico wrote:
> unfortunately, someone smarter then me will have to help further.  I'm
> not sure i see anything specifically wrong.  The one thing i might try
> is backing the software down to a 17.x release series.  I recently
> tried 18.x and had some issues.  I can't say whether it'll be any
> different, but you might be exposing an undiagnosed bug in the 18.x
> branch
> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra <lnicotra at interactions.com> wrote:
>> Made the change in the gres.conf on local server file and restarted slurmd and slurmctld on master.... Unfortunately same error...
>>
>> Distributed corrected gres.conf to all k20 servers, restarted slurmd and slurmdctl...   Still has same error...
>>
>> On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson <bjohanso at psc.edu> wrote:
>>> Is that a lowercase k in k20 specified in the batch script and nodename and a uppercase K specified in gres.conf?
>>>
>>> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>>>
>>> Hi All, I have recently set up a slurm cluster with my servers and I'm running into an issue while submitting GPU jobs. It has something to to with gres configurations, but I just can't seem to figure out what is wrong. Non GPU jobs run fine.
>>>
>>> The error is as follows:
>>> sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) specification  after submitting a batch job.
>>>
>>> My batch job is as follows:
>>> #!/bin/bash
>>> #SBATCH --partition=tiger_1   # partition name
>>> #SBATCH --gres=gpu:k20:1
>>> #SBATCH --gres-flags=enforce-binding
>>> #SBATCH --time=0:20:00  # wall clock limit
>>> #SBATCH --output=gpu-%J.txt
>>> #SBATCH --account=lnicotra
>>> module load cuda
>>> python gpu1
>>>
>>> Where gpu1 is a GPU test script that runs correctly while invoked via python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and K20 as specified in slurm.conf
>>>
>>> I have defined GRES resources in the slurm.conf file:
>>> # GPU GRES
>>> GresTypes=gpu
>>> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>>> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>>>
>>> And have a local gres.conf on the servers containing GPUs...
>>> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>>> # GPU Definitions
>>> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20 File=/dev/nvidia[0-1]
>>> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>>>
>>> and a similar one for the 1080GTX
>>> # GPU Definitions
>>> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX File=/dev/nvidia[0-1]
>>> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>>>
>>> The account manager seems to know about the GPUs...
>>> lnicotra at tiger11 ~# sacctmgr show tres
>>>      Type            Name     ID
>>> -------- --------------- ------
>>>       cpu                      1
>>>       mem                      2
>>>    energy                      3
>>>      node                      4
>>>   billing                      5
>>>        fs            disk      6
>>>      vmem                      7
>>>     pages                      8
>>>      gres             gpu   1001
>>>      gres         gpu:k20   1002
>>>      gres     gpu:1080gtx   1003
>>>
>>> Can anyone point out what am I missing?
>>>
>>> Thanks!
>>> Lou
>>>
>>>
>>> --
>>>
>>> Lou Nicotra
>>>
>>> IT Systems Engineer - SLT
>>>
>>> Interactions LLC
>>>
>>> o:  908-673-1833
>>>
>>> m: 908-451-6983
>>>
>>> lnicotra at interactions.com
>>>
>>> www.interactions.com
>>>
>>> *******************************************************************************
>>>
>>> This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.
>>>
>>> *******************************************************************************
>>>
>>>
>>
>> --
>>
>> Lou Nicotra
>>
>> IT Systems Engineer - SLT
>>
>> Interactions LLC
>>
>> o:  908-673-1833
>>
>> m: 908-451-6983
>>
>> lnicotra at interactions.com
>>
>> www.interactions.com
>>
>> *******************************************************************************
>>
>> This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.
>>
>> *******************************************************************************




More information about the slurm-users mailing list