[slurm-users] GRES GPU issues

Tina Friedrich tina.friedrich at it.ox.ac.uk
Wed Dec 5 02:55:56 MST 2018


I'm running 18.08.3, and I have a fair number of GPU GRES resources - 
recently upgraded to 18.08.03 from a 17.x release. It's definitely not 
as if they don't work in an 18.x release. (I do not distribute the same 
gres.conf file everywhere though, never tried that.)

Just a really stupid question - the /dev/nvidiaX devices do exist, I 
assume? You are running nvidia-persistenced (or something similar) to 
ensure the cards are up & the device files initialised etc?

Tina

On 04/12/2018 23:36, Brian W. Johanson wrote:
> Only thing to suggest once again is increasing the logging of both 
> slurmctl and slurmd.
> As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a 
> db built with 18.x.  I imagine there are enough changes there to cause 
> trouble.
> I don't imagine downgrading will fix your issue, if you are running 
> 18.08.0, the most recent release is 18.08.3.  NEWS packed in the 
> tarballs gives the fixes in the versions.  I don't see any that would 
> fit you case.
> 
> 
> On 12/04/2018 02:11 PM, Lou Nicotra wrote:
>> Brian, I used a single gres.conf file and distributed to all nodes... 
>> Restarted all daemons, unfortunately scontrol still does not show any 
>> Gres resources for GPU nodes...
>>
>> Will try to roll back to 17.X release. Is it basically a matter of 
>> removing 18.x rpms and installing 17's? Does the DB need to be 
>> downgraded also?
>>
>> Thanks...
>> Lou
>>
>> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjohanso at psc.edu 
>> <mailto:bjohanso at psc.edu>> wrote:
>>
>>
>>     Do one more pass through making sure
>>     s/1080GTX/1080gtx and s/K20/k20
>>
>>     shutdown all slurmd, slurmctld, start slurmctl, start slurmd
>>
>>
>>     I find it less confusing to have a global gres.conf file.  I
>>     haven't used a list (nvidia[0-1), mainly because I want to specify
>>     thethe cores to use for each gpu.
>>
>>     gres.conf would look something like...
>>
>>     NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
>>     File=/dev/nvidia0 Cores=0
>>     NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
>>     File=/dev/nvidia1 Cores=1
>>     NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
>>     File=/dev/nvidia0 Cores=0
>>     NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
>>     File=/dev/nvidia1 Cores=1
>>
>>     which can be distributed to all nodes.
>>
>>     -b
>>
>>
>>     On 12/04/2018 09:55 AM, Lou Nicotra wrote:
>>>     Brian, the specific node does not show any gres...
>>>     root at panther02 slurm# scontrol show partition=tiger_1
>>>     PartitionName=tiger_1
>>>        AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>>>        AllocNodes=ALL Default=YES QoS=N/A
>>>        DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
>>>     GraceTime=0 Hidden=NO
>>>        MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
>>>     MaxCPUsPerNode=UNLIMITED
>>>        Nodes=tiger[01-22]
>>>        PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
>>>     OverSubscribe=NO
>>>        OverTimeLimit=NONE PreemptMode=OFF
>>>        State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
>>>        JobDefaults=(null)
>>>        DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>>>
>>>     root at panther02 slurm#  scontrol show node=tiger11
>>>     NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
>>>        CPUAlloc=0 CPUTot=48 CPULoad=11.50
>>>        AvailableFeatures=HyperThread
>>>        ActiveFeatures=HyperThread
>>>        Gres=(null)
>>>        NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
>>>        OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
>>>        RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
>>>        State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
>>>     MCS_label=N/A
>>>        Partitions=tiger_1,compute_1
>>>        BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
>>>        CfgTRES=cpu=48,mem=1M,billing=48
>>>        AllocTRES=
>>>        CapWatts=n/a
>>>        CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>>        ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>
>>>     So, something is not setup correctly... Could it be a 18.X bug?
>>>
>>>     Thanks.
>>>
>>>
>>>     On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra
>>>     <lnicotra at interactions.com <mailto:lnicotra at interactions.com>> wrote:
>>>
>>>         Thanks Michael. I will try 17.x as I also could not see
>>>         anything wrong with my settings... Will report back
>>>         afterwards...
>>>
>>>         Lou
>>>
>>>         On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
>>>         <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>> wrote:
>>>
>>>             unfortunately, someone smarter then me will have to help
>>>             further.  I'm
>>>             not sure i see anything specifically wrong.  The one
>>>             thing i might try
>>>             is backing the software down to a 17.x release series.  I
>>>             recently
>>>             tried 18.x and had some issues.  I can't say whether
>>>             it'll be any
>>>             different, but you might be exposing an undiagnosed bug
>>>             in the 18.x
>>>             branch
>>>             On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
>>>             <lnicotra at interactions.com
>>>             <mailto:lnicotra at interactions.com>> wrote:
>>>             >
>>>             > Made the change in the gres.conf on local server file
>>>             and restarted slurmd and slurmctld on master....
>>>             Unfortunately same error...
>>>             >
>>>             > Distributed corrected gres.conf to all k20 servers,
>>>             restarted slurmd and slurmdctl...   Still has same error...
>>>             >
>>>             > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
>>>             <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
>>>             >>
>>>             >> Is that a lowercase k in k20 specified in the batch
>>>             script and nodename and a uppercase K specified in gres.conf?
>>>             >>
>>>             >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>>>             >>
>>>             >> Hi All, I have recently set up a slurm cluster with my
>>>             servers and I'm running into an issue while submitting
>>>             GPU jobs. It has something to to with gres
>>>             configurations, but I just can't seem to figure out what
>>>             is wrong. Non GPU jobs run fine.
>>>             >>
>>>             >> The error is as follows:
>>>             >> sbatch: error: Batch job submission failed: Invalid
>>>             Trackable RESource (TRES) specification  after submitting
>>>             a batch job.
>>>             >>
>>>             >> My batch job is as follows:
>>>             >> #!/bin/bash
>>>             >> #SBATCH --partition=tiger_1   # partition name
>>>             >> #SBATCH --gres=gpu:k20:1
>>>             >> #SBATCH --gres-flags=enforce-binding
>>>             >> #SBATCH --time=0:20:00  # wall clock limit
>>>             >> #SBATCH --output=gpu-%J.txt
>>>             >> #SBATCH --account=lnicotra
>>>             >> module load cuda
>>>             >> python gpu1
>>>             >>
>>>             >> Where gpu1 is a GPU test script that runs correctly
>>>             while invoked via python. Tiger_1 partition has servers
>>>             with GPUs, with a mix of 1080GTX and K20 as specified in
>>>             slurm.conf
>>>             >>
>>>             >> I have defined GRES resources in the slurm.conf file:
>>>             >> # GPU GRES
>>>             >> GresTypes=gpu
>>>             >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>>>             >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
>>>             Gres=gpu:k20:2
>>>             >>
>>>             >> And have a local gres.conf on the servers containing
>>>             GPUs...
>>>             >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>>>             >> # GPU Definitions
>>>             >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
>>>             Name=gpu Type=K20 File=/dev/nvidia[0-1]
>>>             >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>>>             >>
>>>             >> and a similar one for the 1080GTX
>>>             >> # GPU Definitions
>>>             >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
>>>             File=/dev/nvidia[0-1]
>>>             >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>>>             >>
>>>             >> The account manager seems to know about the GPUs...
>>>             >> lnicotra at tiger11 ~# sacctmgr show tres
>>>             >>     Type            Name     ID
>>>             >> -------- --------------- ------
>>>             >>      cpu                      1
>>>             >>      mem                      2
>>>             >>   energy                      3
>>>             >>     node                      4
>>>             >>  billing                      5
>>>             >>       fs            disk      6
>>>             >>     vmem                      7
>>>             >>    pages                      8
>>>             >>     gres             gpu   1001
>>>             >>     gres         gpu:k20   1002
>>>             >>     gres     gpu:1080gtx   1003
>>>             >>
>>>             >> Can anyone point out what am I missing?
>>>             >>
>>>             >> Thanks!
>>>             >> Lou
>>>             >>
>>>             >>
>>>             >> --
>>>             >>
>>>             >> Lou Nicotra
>>>             >>
>>>             >> IT Systems Engineer - SLT
>>>             >>
>>>             >> Interactions LLC
>>>             >>
>>>             >> o:  908-673-1833
>>>             >>
>>>             >> m: 908-451-6983
>>>             >>
>>>             >> lnicotra at interactions.com
>>>             <mailto:lnicotra at interactions.com>
>>>             >>
>>>             >> www.interactions.com <http://www.interactions.com>
>>>             >>
>>>             >>
>>>             *******************************************************************************
>>>             >>
>>>             >> This e-mail and any of its attachments may contain
>>>             Interactions LLC proprietary information, which is
>>>             privileged, confidential, or subject to copyright
>>>             belonging to the Interactions LLC. This e-mail is
>>>             intended solely for the use of the individual or entity
>>>             to which it is addressed. If you are not the intended
>>>             recipient of this e-mail, you are hereby notified that
>>>             any dissemination, distribution, copying, or action taken
>>>             in relation to the contents of and attachments to this
>>>             e-mail is strictly prohibited and may be unlawful. If you
>>>             have received this e-mail in error, please notify the
>>>             sender immediately and permanently delete the original
>>>             and any copy of this e-mail and any printout. Thank You.
>>>             >>
>>>             >>
>>>             *******************************************************************************
>>>             >>
>>>             >>
>>>             >
>>>             >
>>>             > --
>>>             >
>>>             > Lou Nicotra
>>>             >
>>>             > IT Systems Engineer - SLT
>>>             >
>>>             > Interactions LLC
>>>             >
>>>             > o:  908-673-1833
>>>             >
>>>             > m: 908-451-6983
>>>             >
>>>             > lnicotra at interactions.com
>>>             <mailto:lnicotra at interactions.com>
>>>             >
>>>             > www.interactions.com <http://www.interactions.com>
>>>             >
>>>             >
>>>             *******************************************************************************
>>>             >
>>>             > This e-mail and any of its attachments may contain
>>>             Interactions LLC proprietary information, which is
>>>             privileged, confidential, or subject to copyright
>>>             belonging to the Interactions LLC. This e-mail is
>>>             intended solely for the use of the individual or entity
>>>             to which it is addressed. If you are not the intended
>>>             recipient of this e-mail, you are hereby notified that
>>>             any dissemination, distribution, copying, or action taken
>>>             in relation to the contents of and attachments to this
>>>             e-mail is strictly prohibited and may be unlawful. If you
>>>             have received this e-mail in error, please notify the
>>>             sender immediately and permanently delete the original
>>>             and any copy of this e-mail and any printout. Thank You.
>>>             >
>>>             >
>>>             *******************************************************************************
>>>
>>>
>>>
>>>         -- 
>>>
>>>         *Lou Nicotra*
>>>
>>>         IT Systems Engineer - SLT
>>>
>>>         Interactions LLC
>>>
>>>         o: 908-673-1833 <tel:781-405-5114>
>>>
>>>         m: 908-451-6983 <tel:781-405-5114>
>>>
>>>         _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>>
>>>         www.interactions.com <http://www.interactions.com/>
>>>
>>>
>>>
>>>     -- 
>>>
>>>     *Lou Nicotra*
>>>
>>>     IT Systems Engineer - SLT
>>>
>>>     Interactions LLC
>>>
>>>     o: 908-673-1833 <tel:781-405-5114>
>>>
>>>     m: 908-451-6983 <tel:781-405-5114>
>>>
>>>     _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>>
>>>     www.interactions.com <http://www.interactions.com/>
>>>
>>>     *******************************************************************************
>>>
>>>     This e-mail and any of its attachments may contain Interactions
>>>     LLC proprietary information, which is privileged, confidential,
>>>     or subject to copyright belonging to the Interactions LLC. This
>>>     e-mail is intended solely for the use of the individual or entity
>>>     to which it is addressed. If you are not the intended recipient
>>>     of this e-mail, you are hereby notified that any dissemination,
>>>     distribution, copying, or action taken in relation to the
>>>     contents of and attachments to this e-mail is strictly prohibited
>>>     and may be unlawful. If you have received this e-mail in error,
>>>     please notify the sender immediately and permanently delete the
>>>     original and any copy of this e-mail and any printout. Thank You.
>>>
>>>     *******************************************************************************
>>>
>>
>>
>>
>> -- 
>>
>> *Lou Nicotra*
>>
>> IT Systems Engineer - SLT
>>
>> Interactions LLC
>>
>> o: 908-673-1833 <tel:781-405-5114>
>>
>> m: 908-451-6983 <tel:781-405-5114>
>>
>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>
>> www.interactions.com <http://www.interactions.com/>
>>
>> *******************************************************************************
>>
>> This e-mail and any of its attachments may contain Interactions LLC 
>> proprietary information, which is privileged, confidential, or subject 
>> to copyright belonging to the Interactions LLC. This e-mail is 
>> intended solely for the use of the individual or entity to which it is 
>> addressed. If you are not the intended recipient of this e-mail, you 
>> are hereby notified that any dissemination, distribution, copying, or 
>> action taken in relation to the contents of and attachments to this 
>> e-mail is strictly prohibited and may be unlawful. If you have 
>> received this e-mail in error, please notify the sender immediately 
>> and permanently delete the original and any copy of this e-mail and 
>> any printout. Thank You.
>>
>> *******************************************************************************
>>
> 


More information about the slurm-users mailing list