[slurm-users] GRES GPU issues
Tina Friedrich
tina.friedrich at it.ox.ac.uk
Wed Dec 5 02:55:56 MST 2018
I'm running 18.08.3, and I have a fair number of GPU GRES resources -
recently upgraded to 18.08.03 from a 17.x release. It's definitely not
as if they don't work in an 18.x release. (I do not distribute the same
gres.conf file everywhere though, never tried that.)
Just a really stupid question - the /dev/nvidiaX devices do exist, I
assume? You are running nvidia-persistenced (or something similar) to
ensure the cards are up & the device files initialised etc?
Tina
On 04/12/2018 23:36, Brian W. Johanson wrote:
> Only thing to suggest once again is increasing the logging of both
> slurmctl and slurmd.
> As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a
> db built with 18.x. I imagine there are enough changes there to cause
> trouble.
> I don't imagine downgrading will fix your issue, if you are running
> 18.08.0, the most recent release is 18.08.3. NEWS packed in the
> tarballs gives the fixes in the versions. I don't see any that would
> fit you case.
>
>
> On 12/04/2018 02:11 PM, Lou Nicotra wrote:
>> Brian, I used a single gres.conf file and distributed to all nodes...
>> Restarted all daemons, unfortunately scontrol still does not show any
>> Gres resources for GPU nodes...
>>
>> Will try to roll back to 17.X release. Is it basically a matter of
>> removing 18.x rpms and installing 17's? Does the DB need to be
>> downgraded also?
>>
>> Thanks...
>> Lou
>>
>> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjohanso at psc.edu
>> <mailto:bjohanso at psc.edu>> wrote:
>>
>>
>> Do one more pass through making sure
>> s/1080GTX/1080gtx and s/K20/k20
>>
>> shutdown all slurmd, slurmctld, start slurmctl, start slurmd
>>
>>
>> I find it less confusing to have a global gres.conf file. I
>> haven't used a list (nvidia[0-1), mainly because I want to specify
>> thethe cores to use for each gpu.
>>
>> gres.conf would look something like...
>>
>> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
>> File=/dev/nvidia0 Cores=0
>> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
>> File=/dev/nvidia1 Cores=1
>> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
>> File=/dev/nvidia0 Cores=0
>> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
>> File=/dev/nvidia1 Cores=1
>>
>> which can be distributed to all nodes.
>>
>> -b
>>
>>
>> On 12/04/2018 09:55 AM, Lou Nicotra wrote:
>>> Brian, the specific node does not show any gres...
>>> root at panther02 slurm# scontrol show partition=tiger_1
>>> PartitionName=tiger_1
>>> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>>> AllocNodes=ALL Default=YES QoS=N/A
>>> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
>>> GraceTime=0 Hidden=NO
>>> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
>>> MaxCPUsPerNode=UNLIMITED
>>> Nodes=tiger[01-22]
>>> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
>>> OverSubscribe=NO
>>> OverTimeLimit=NONE PreemptMode=OFF
>>> State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
>>> JobDefaults=(null)
>>> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>>>
>>> root at panther02 slurm# scontrol show node=tiger11
>>> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
>>> CPUAlloc=0 CPUTot=48 CPULoad=11.50
>>> AvailableFeatures=HyperThread
>>> ActiveFeatures=HyperThread
>>> Gres=(null)
>>> NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
>>> OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
>>> RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
>>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
>>> MCS_label=N/A
>>> Partitions=tiger_1,compute_1
>>> BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
>>> CfgTRES=cpu=48,mem=1M,billing=48
>>> AllocTRES=
>>> CapWatts=n/a
>>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>
>>> So, something is not setup correctly... Could it be a 18.X bug?
>>>
>>> Thanks.
>>>
>>>
>>> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra
>>> <lnicotra at interactions.com <mailto:lnicotra at interactions.com>> wrote:
>>>
>>> Thanks Michael. I will try 17.x as I also could not see
>>> anything wrong with my settings... Will report back
>>> afterwards...
>>>
>>> Lou
>>>
>>> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
>>> <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>> wrote:
>>>
>>> unfortunately, someone smarter then me will have to help
>>> further. I'm
>>> not sure i see anything specifically wrong. The one
>>> thing i might try
>>> is backing the software down to a 17.x release series. I
>>> recently
>>> tried 18.x and had some issues. I can't say whether
>>> it'll be any
>>> different, but you might be exposing an undiagnosed bug
>>> in the 18.x
>>> branch
>>> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
>>> <lnicotra at interactions.com
>>> <mailto:lnicotra at interactions.com>> wrote:
>>> >
>>> > Made the change in the gres.conf on local server file
>>> and restarted slurmd and slurmctld on master....
>>> Unfortunately same error...
>>> >
>>> > Distributed corrected gres.conf to all k20 servers,
>>> restarted slurmd and slurmdctl... Still has same error...
>>> >
>>> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
>>> <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
>>> >>
>>> >> Is that a lowercase k in k20 specified in the batch
>>> script and nodename and a uppercase K specified in gres.conf?
>>> >>
>>> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>>> >>
>>> >> Hi All, I have recently set up a slurm cluster with my
>>> servers and I'm running into an issue while submitting
>>> GPU jobs. It has something to to with gres
>>> configurations, but I just can't seem to figure out what
>>> is wrong. Non GPU jobs run fine.
>>> >>
>>> >> The error is as follows:
>>> >> sbatch: error: Batch job submission failed: Invalid
>>> Trackable RESource (TRES) specification after submitting
>>> a batch job.
>>> >>
>>> >> My batch job is as follows:
>>> >> #!/bin/bash
>>> >> #SBATCH --partition=tiger_1 # partition name
>>> >> #SBATCH --gres=gpu:k20:1
>>> >> #SBATCH --gres-flags=enforce-binding
>>> >> #SBATCH --time=0:20:00 # wall clock limit
>>> >> #SBATCH --output=gpu-%J.txt
>>> >> #SBATCH --account=lnicotra
>>> >> module load cuda
>>> >> python gpu1
>>> >>
>>> >> Where gpu1 is a GPU test script that runs correctly
>>> while invoked via python. Tiger_1 partition has servers
>>> with GPUs, with a mix of 1080GTX and K20 as specified in
>>> slurm.conf
>>> >>
>>> >> I have defined GRES resources in the slurm.conf file:
>>> >> # GPU GRES
>>> >> GresTypes=gpu
>>> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>>> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
>>> Gres=gpu:k20:2
>>> >>
>>> >> And have a local gres.conf on the servers containing
>>> GPUs...
>>> >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>>> >> # GPU Definitions
>>> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
>>> Name=gpu Type=K20 File=/dev/nvidia[0-1]
>>> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>>> >>
>>> >> and a similar one for the 1080GTX
>>> >> # GPU Definitions
>>> >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
>>> File=/dev/nvidia[0-1]
>>> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>>> >>
>>> >> The account manager seems to know about the GPUs...
>>> >> lnicotra at tiger11 ~# sacctmgr show tres
>>> >> Type Name ID
>>> >> -------- --------------- ------
>>> >> cpu 1
>>> >> mem 2
>>> >> energy 3
>>> >> node 4
>>> >> billing 5
>>> >> fs disk 6
>>> >> vmem 7
>>> >> pages 8
>>> >> gres gpu 1001
>>> >> gres gpu:k20 1002
>>> >> gres gpu:1080gtx 1003
>>> >>
>>> >> Can anyone point out what am I missing?
>>> >>
>>> >> Thanks!
>>> >> Lou
>>> >>
>>> >>
>>> >> --
>>> >>
>>> >> Lou Nicotra
>>> >>
>>> >> IT Systems Engineer - SLT
>>> >>
>>> >> Interactions LLC
>>> >>
>>> >> o: 908-673-1833
>>> >>
>>> >> m: 908-451-6983
>>> >>
>>> >> lnicotra at interactions.com
>>> <mailto:lnicotra at interactions.com>
>>> >>
>>> >> www.interactions.com <http://www.interactions.com>
>>> >>
>>> >>
>>> *******************************************************************************
>>> >>
>>> >> This e-mail and any of its attachments may contain
>>> Interactions LLC proprietary information, which is
>>> privileged, confidential, or subject to copyright
>>> belonging to the Interactions LLC. This e-mail is
>>> intended solely for the use of the individual or entity
>>> to which it is addressed. If you are not the intended
>>> recipient of this e-mail, you are hereby notified that
>>> any dissemination, distribution, copying, or action taken
>>> in relation to the contents of and attachments to this
>>> e-mail is strictly prohibited and may be unlawful. If you
>>> have received this e-mail in error, please notify the
>>> sender immediately and permanently delete the original
>>> and any copy of this e-mail and any printout. Thank You.
>>> >>
>>> >>
>>> *******************************************************************************
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> >
>>> > Lou Nicotra
>>> >
>>> > IT Systems Engineer - SLT
>>> >
>>> > Interactions LLC
>>> >
>>> > o: 908-673-1833
>>> >
>>> > m: 908-451-6983
>>> >
>>> > lnicotra at interactions.com
>>> <mailto:lnicotra at interactions.com>
>>> >
>>> > www.interactions.com <http://www.interactions.com>
>>> >
>>> >
>>> *******************************************************************************
>>> >
>>> > This e-mail and any of its attachments may contain
>>> Interactions LLC proprietary information, which is
>>> privileged, confidential, or subject to copyright
>>> belonging to the Interactions LLC. This e-mail is
>>> intended solely for the use of the individual or entity
>>> to which it is addressed. If you are not the intended
>>> recipient of this e-mail, you are hereby notified that
>>> any dissemination, distribution, copying, or action taken
>>> in relation to the contents of and attachments to this
>>> e-mail is strictly prohibited and may be unlawful. If you
>>> have received this e-mail in error, please notify the
>>> sender immediately and permanently delete the original
>>> and any copy of this e-mail and any printout. Thank You.
>>> >
>>> >
>>> *******************************************************************************
>>>
>>>
>>>
>>> --
>>>
>>> *Lou Nicotra*
>>>
>>> IT Systems Engineer - SLT
>>>
>>> Interactions LLC
>>>
>>> o: 908-673-1833 <tel:781-405-5114>
>>>
>>> m: 908-451-6983 <tel:781-405-5114>
>>>
>>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>>
>>> www.interactions.com <http://www.interactions.com/>
>>>
>>>
>>>
>>> --
>>>
>>> *Lou Nicotra*
>>>
>>> IT Systems Engineer - SLT
>>>
>>> Interactions LLC
>>>
>>> o: 908-673-1833 <tel:781-405-5114>
>>>
>>> m: 908-451-6983 <tel:781-405-5114>
>>>
>>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>>
>>> www.interactions.com <http://www.interactions.com/>
>>>
>>> *******************************************************************************
>>>
>>> This e-mail and any of its attachments may contain Interactions
>>> LLC proprietary information, which is privileged, confidential,
>>> or subject to copyright belonging to the Interactions LLC. This
>>> e-mail is intended solely for the use of the individual or entity
>>> to which it is addressed. If you are not the intended recipient
>>> of this e-mail, you are hereby notified that any dissemination,
>>> distribution, copying, or action taken in relation to the
>>> contents of and attachments to this e-mail is strictly prohibited
>>> and may be unlawful. If you have received this e-mail in error,
>>> please notify the sender immediately and permanently delete the
>>> original and any copy of this e-mail and any printout. Thank You.
>>>
>>> *******************************************************************************
>>>
>>
>>
>>
>> --
>>
>> *Lou Nicotra*
>>
>> IT Systems Engineer - SLT
>>
>> Interactions LLC
>>
>> o: 908-673-1833 <tel:781-405-5114>
>>
>> m: 908-451-6983 <tel:781-405-5114>
>>
>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>
>> www.interactions.com <http://www.interactions.com/>
>>
>> *******************************************************************************
>>
>> This e-mail and any of its attachments may contain Interactions LLC
>> proprietary information, which is privileged, confidential, or subject
>> to copyright belonging to the Interactions LLC. This e-mail is
>> intended solely for the use of the individual or entity to which it is
>> addressed. If you are not the intended recipient of this e-mail, you
>> are hereby notified that any dissemination, distribution, copying, or
>> action taken in relation to the contents of and attachments to this
>> e-mail is strictly prohibited and may be unlawful. If you have
>> received this e-mail in error, please notify the sender immediately
>> and permanently delete the original and any copy of this e-mail and
>> any printout. Thank You.
>>
>> *******************************************************************************
>>
>
More information about the slurm-users
mailing list