[slurm-users] GRES GPU issues
Tina Friedrich
tina.friedrich at it.ox.ac.uk
Wed Dec 5 07:38:22 MST 2018
Hello,
don't mind sharing the config at all. Not sure it helps though, it's
pretty basic.
Picking an example node, I have
[ ~]$ scontrol show node arcus-htc-gpu011
NodeName=arcus-htc-gpu011 Arch=x86_64 CoresPerSocket=8
CPUAlloc=16 CPUTot=16 CPULoad=20.43
AvailableFeatures=cpu_gen:Haswell,cpu_sku:E5-2640v3,cpu_frq:2.60GHz,cpu_mem:64GB,gpu,gpu_mem:12GB,gpu_gen:Kepler,gpu_sku:K40,gpu_cc:3.5,
ActiveFeatures=cpu_gen:Haswell,cpu_sku:E5-2640v3,cpu_frq:2.60GHz,cpu_mem:64GB,gpu,gpu_mem:12GB,gpu_gen:Kepler,gpu_sku:K40,gpu_cc:3.5,
Gres=gpu:k40m:2
NodeAddr=arcus-htc-gpu011 NodeHostName=arcus-htc-gpu011
OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
RealMemory=63000 AllocMem=0 FreeMem=56295 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=96 Owner=N/A
MCS_label=N/A
Partitions=htc
BootTime=2018-11-28T15:12:29 SlurmdStartTime=2018-11-28T17:58:55
CfgTRES=cpu=16,mem=63000M,billing=16
AllocTRES=cpu=16
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
gres.conf on arcus-htc-gpu011 is
[ ~]$ cat /etc/slurm/gres.conf
Name=gpu Type=k40m File=/dev/nvidia0
Name=gpu Type=k40m File=/dev/nvidia1
Relevant bits of slurm.conf are, I believe
GresTypes=hbm,gpu
(DebugFlags=Priority,Backfill,NodeFeatures,Gres,Protocol,TraceJobs)
NodeName=arcus-htc-gpu009,arcus-htc-gpu[011-018] Weight=96 Sockets=2
CoresPerSocket=8 ThreadsPerCore=1 RealMemory=63000 Gres=gpu:k40m:2
Feature=cpu_gen:Haswell,cpu_sku:E5-2640v3,cpu_frq:2.60GHz,cpu_mem:64GB,gpu,gpu_mem:12GB,gpu_gen:Kepler,gpu_sku:K40,gpu_cc:3.5,
Don't think I did anything else.
I have other types of nodes - couple of P100s, couple of V100s, couple
of K80s and one or two odd things (M40, P4).
Used to run with a gres.conf that simply had 'Name=gpu
File=/dev/nvidia[0-2]' (or [0-4], depending) and that also worked; I
introduced the type when I gained a node that has two different nvidia
cards, so what was on what port became important, not because the
'range' configuration caused problems.
This wasn't a fresh install of 18.x - it was a 17.x installation that I
upgraded to 18.x. Not sure if that makes a difference. I made no changes
to anything (slurm.conf, gres.conf) with the update though. I just
installed the new rpms.
Tina
On 05/12/2018 13:20, Lou Nicotra wrote:
> Tina, thanks for confirming that GPU GRES resources work with 18.08... I
> might just upgrade to 18.08.03 as I am running 18.08.0
>
> The nvidia devices exists on all servers and persistence is set. They
> have been in there for a number of years and our users make use of them
> daily. I can actually see that slurmd knows about them while restarting
> the daemon:
> [2018-12-05T08:03:35.989] Slurmd shutdown completing
> [2018-12-05T08:03:36.015] Message aggregation disabled
> [2018-12-05T08:03:36.016] gpu device number 0(/dev/nvidia0):c 195:0 rwm
> [2018-12-05T08:03:36.017] gpu device number 1(/dev/nvidia1):c 195:1 rwm
> [2018-12-05T08:03:36.059] slurmd version 18.08.0 started
> [2018-12-05T08:03:36.059] slurmd started on Wed, 05 Dec 2018 08:03:36 -0500
> [2018-12-05T08:03:36.059] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
> Memory=386757 TmpDisk=4758 Uptime=21324804 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
>
> Would you mind sharing the portions of the slurm.conf and corresponding
> GRES definitions that you are using?. You have individual GRES files for
> each server based on GPU type? I tried both, none of them work.
>
> My slurm.conf file has entries for GPUs as follows:
> GresTypes=gpu
> #AccountingStorageTRES=gres/gpu,gres/gpu:k20,gres/gpu:1080gtx
> (currently commented out)
>
> gres.conf is as follows (had tried different configs, no change with
> either one...)
> # GPU Definitions
> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0
> Cores=0
> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1
> Cores=1
> #NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
> File=/dev/nvidia[0-1] Cores=0,1
>
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
> File=/dev/nvidia0 Cores=0
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
> File=/dev/nvidia1 Cores=1
> #NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
> File=/dev/nvidia[0-1] Cores=0,1
>
> What am I missing?
>
> Thanks...
>
>
>
>
> On Wed, Dec 5, 2018 at 4:59 AM Tina Friedrich
> <tina.friedrich at it.ox.ac.uk <mailto:tina.friedrich at it.ox.ac.uk>> wrote:
>
> I'm running 18.08.3, and I have a fair number of GPU GRES resources -
> recently upgraded to 18.08.03 from a 17.x release. It's definitely not
> as if they don't work in an 18.x release. (I do not distribute the same
> gres.conf file everywhere though, never tried that.)
>
> Just a really stupid question - the /dev/nvidiaX devices do exist, I
> assume? You are running nvidia-persistenced (or something similar) to
> ensure the cards are up & the device files initialised etc?
>
> Tina
>
> On 04/12/2018 23:36, Brian W. Johanson wrote:
> > Only thing to suggest once again is increasing the logging of both
> > slurmctl and slurmd.
> > As for downgrading, I wouldn't suggest running a 17.x slurmdbd
> against a
> > db built with 18.x. I imagine there are enough changes there to
> cause
> > trouble.
> > I don't imagine downgrading will fix your issue, if you are running
> > 18.08.0, the most recent release is 18.08.3. NEWS packed in the
> > tarballs gives the fixes in the versions. I don't see any that
> would
> > fit you case.
> >
> >
> > On 12/04/2018 02:11 PM, Lou Nicotra wrote:
> >> Brian, I used a single gres.conf file and distributed to all
> nodes...
> >> Restarted all daemons, unfortunately scontrol still does not
> show any
> >> Gres resources for GPU nodes...
> >>
> >> Will try to roll back to 17.X release. Is it basically a matter of
> >> removing 18.x rpms and installing 17's? Does the DB need to be
> >> downgraded also?
> >>
> >> Thanks...
> >> Lou
> >>
> >> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson
> <bjohanso at psc.edu <mailto:bjohanso at psc.edu>
> >> <mailto:bjohanso at psc.edu <mailto:bjohanso at psc.edu>>> wrote:
> >>
> >>
> >> Do one more pass through making sure
> >> s/1080GTX/1080gtx and s/K20/k20
> >>
> >> shutdown all slurmd, slurmctld, start slurmctl, start slurmd
> >>
> >>
> >> I find it less confusing to have a global gres.conf file. I
> >> haven't used a list (nvidia[0-1), mainly because I want to
> specify
> >> thethe cores to use for each gpu.
> >>
> >> gres.conf would look something like...
> >>
> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> >> File=/dev/nvidia0 Cores=0
> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> >> File=/dev/nvidia1 Cores=1
> >> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
> >> File=/dev/nvidia0 Cores=0
> >> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
> >> File=/dev/nvidia1 Cores=1
> >>
> >> which can be distributed to all nodes.
> >>
> >> -b
> >>
> >>
> >> On 12/04/2018 09:55 AM, Lou Nicotra wrote:
> >>> Brian, the specific node does not show any gres...
> >>> root at panther02 slurm# scontrol show partition=tiger_1
> >>> PartitionName=tiger_1
> >>> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> >>> AllocNodes=ALL Default=YES QoS=N/A
> >>> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
> >>> GraceTime=0 Hidden=NO
> >>> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
> >>> MaxCPUsPerNode=UNLIMITED
> >>> Nodes=tiger[01-22]
> >>> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> >>> OverSubscribe=NO
> >>> OverTimeLimit=NONE PreemptMode=OFF
> >>> State=UP TotalCPUs=1056 TotalNodes=22
> SelectTypeParameters=NONE
> >>> JobDefaults=(null)
> >>> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
> >>>
> >>> root at panther02 slurm# scontrol show node=tiger11
> >>> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
> >>> CPUAlloc=0 CPUTot=48 CPULoad=11.50
> >>> AvailableFeatures=HyperThread
> >>> ActiveFeatures=HyperThread
> >>> Gres=(null)
> >>> NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
> >>> OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19
> 22:10:57 UTC 2015
> >>> RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
> >>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> >>> MCS_label=N/A
> >>> Partitions=tiger_1,compute_1
> >>> BootTime=2018-04-02T13:30:12
> SlurmdStartTime=2018-12-03T16:13:22
> >>> CfgTRES=cpu=48,mem=1M,billing=48
> >>> AllocTRES=
> >>> CapWatts=n/a
> >>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >>>
> >>> So, something is not setup correctly... Could it be a 18.X bug?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra
> >>> <lnicotra at interactions.com
> <mailto:lnicotra at interactions.com> <mailto:lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>>> wrote:
> >>>
> >>> Thanks Michael. I will try 17.x as I also could not see
> >>> anything wrong with my settings... Will report back
> >>> afterwards...
> >>>
> >>> Lou
> >>>
> >>> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
> >>> <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>
> <mailto:mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>>> wrote:
> >>>
> >>> unfortunately, someone smarter then me will have to
> help
> >>> further. I'm
> >>> not sure i see anything specifically wrong. The one
> >>> thing i might try
> >>> is backing the software down to a 17.x release
> series. I
> >>> recently
> >>> tried 18.x and had some issues. I can't say whether
> >>> it'll be any
> >>> different, but you might be exposing an undiagnosed bug
> >>> in the 18.x
> >>> branch
> >>> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
> >>> <lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>
> >>> <mailto:lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>>> wrote:
> >>> >
> >>> > Made the change in the gres.conf on local server file
> >>> and restarted slurmd and slurmctld on master....
> >>> Unfortunately same error...
> >>> >
> >>> > Distributed corrected gres.conf to all k20 servers,
> >>> restarted slurmd and slurmdctl... Still has same
> error...
> >>> >
> >>> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
> >>> <bjohanso at psc.edu <mailto:bjohanso at psc.edu>
> <mailto:bjohanso at psc.edu <mailto:bjohanso at psc.edu>>> wrote:
> >>> >>
> >>> >> Is that a lowercase k in k20 specified in the batch
> >>> script and nodename and a uppercase K specified in
> gres.conf?
> >>> >>
> >>> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
> >>> >>
> >>> >> Hi All, I have recently set up a slurm cluster
> with my
> >>> servers and I'm running into an issue while submitting
> >>> GPU jobs. It has something to to with gres
> >>> configurations, but I just can't seem to figure out
> what
> >>> is wrong. Non GPU jobs run fine.
> >>> >>
> >>> >> The error is as follows:
> >>> >> sbatch: error: Batch job submission failed: Invalid
> >>> Trackable RESource (TRES) specification after
> submitting
> >>> a batch job.
> >>> >>
> >>> >> My batch job is as follows:
> >>> >> #!/bin/bash
> >>> >> #SBATCH --partition=tiger_1 # partition name
> >>> >> #SBATCH --gres=gpu:k20:1
> >>> >> #SBATCH --gres-flags=enforce-binding
> >>> >> #SBATCH --time=0:20:00 # wall clock limit
> >>> >> #SBATCH --output=gpu-%J.txt
> >>> >> #SBATCH --account=lnicotra
> >>> >> module load cuda
> >>> >> python gpu1
> >>> >>
> >>> >> Where gpu1 is a GPU test script that runs correctly
> >>> while invoked via python. Tiger_1 partition has servers
> >>> with GPUs, with a mix of 1080GTX and K20 as
> specified in
> >>> slurm.conf
> >>> >>
> >>> >> I have defined GRES resources in the slurm.conf
> file:
> >>> >> # GPU GRES
> >>> >> GresTypes=gpu
> >>> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
> >>> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
> >>> Gres=gpu:k20:2
> >>> >>
> >>> >> And have a local gres.conf on the servers containing
> >>> GPUs...
> >>> >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
> >>> >> # GPU Definitions
> >>> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
> >>> Name=gpu Type=K20 File=/dev/nvidia[0-1]
> >>> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
> >>> >>
> >>> >> and a similar one for the 1080GTX
> >>> >> # GPU Definitions
> >>> >> # NodeName=tiger[01,05,10,15,20] Name=gpu
> Type=1080GTX
> >>> File=/dev/nvidia[0-1]
> >>> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1]
> Cores=0,1
> >>> >>
> >>> >> The account manager seems to know about the GPUs...
> >>> >> lnicotra at tiger11 ~# sacctmgr show tres
> >>> >> Type Name ID
> >>> >> -------- --------------- ------
> >>> >> cpu 1
> >>> >> mem 2
> >>> >> energy 3
> >>> >> node 4
> >>> >> billing 5
> >>> >> fs disk 6
> >>> >> vmem 7
> >>> >> pages 8
> >>> >> gres gpu 1001
> >>> >> gres gpu:k20 1002
> >>> >> gres gpu:1080gtx 1003
> >>> >>
> >>> >> Can anyone point out what am I missing?
> >>> >>
> >>> >> Thanks!
> >>> >> Lou
> >>> >>
> >>> >>
> >>> >> --
> >>> >>
> >>> >> Lou Nicotra
> >>> >>
> >>> >> IT Systems Engineer - SLT
> >>> >>
> >>> >> Interactions LLC
> >>> >>
> >>> >> o: 908-673-1833
> >>> >>
> >>> >> m: 908-451-6983
> >>> >>
> >>> >> lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>
> >>> <mailto:lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>>
> >>> >>
> >>> >> www.interactions.com
> <http://www.interactions.com> <http://www.interactions.com>
> >>> >>
> >>> >>
> >>>
> *******************************************************************************
> >>> >>
> >>> >> This e-mail and any of its attachments may contain
> >>> Interactions LLC proprietary information, which is
> >>> privileged, confidential, or subject to copyright
> >>> belonging to the Interactions LLC. This e-mail is
> >>> intended solely for the use of the individual or entity
> >>> to which it is addressed. If you are not the intended
> >>> recipient of this e-mail, you are hereby notified that
> >>> any dissemination, distribution, copying, or action
> taken
> >>> in relation to the contents of and attachments to this
> >>> e-mail is strictly prohibited and may be unlawful.
> If you
> >>> have received this e-mail in error, please notify the
> >>> sender immediately and permanently delete the original
> >>> and any copy of this e-mail and any printout. Thank
> You.
> >>> >>
> >>> >>
> >>>
> *******************************************************************************
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> >
> >>> > Lou Nicotra
> >>> >
> >>> > IT Systems Engineer - SLT
> >>> >
> >>> > Interactions LLC
> >>> >
> >>> > o: 908-673-1833
> >>> >
> >>> > m: 908-451-6983
> >>> >
> >>> > lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>
> >>> <mailto:lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>>
> >>> >
> >>> > www.interactions.com
> <http://www.interactions.com> <http://www.interactions.com>
> >>> >
> >>> >
> >>>
> *******************************************************************************
> >>> >
> >>> > This e-mail and any of its attachments may contain
> >>> Interactions LLC proprietary information, which is
> >>> privileged, confidential, or subject to copyright
> >>> belonging to the Interactions LLC. This e-mail is
> >>> intended solely for the use of the individual or entity
> >>> to which it is addressed. If you are not the intended
> >>> recipient of this e-mail, you are hereby notified that
> >>> any dissemination, distribution, copying, or action
> taken
> >>> in relation to the contents of and attachments to this
> >>> e-mail is strictly prohibited and may be unlawful.
> If you
> >>> have received this e-mail in error, please notify the
> >>> sender immediately and permanently delete the original
> >>> and any copy of this e-mail and any printout. Thank
> You.
> >>> >
> >>> >
> >>>
> *******************************************************************************
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> *Lou Nicotra*
> >>>
> >>> IT Systems Engineer - SLT
> >>>
> >>> Interactions LLC
> >>>
> >>> o: 908-673-1833 <tel:781-405-5114>
> >>>
> >>> m: 908-451-6983 <tel:781-405-5114>
> >>>
> >>> _lnicotra at interactions.com
> <mailto:lnicotra at interactions.com> <mailto:lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>>_
> >>>
> >>> www.interactions.com <http://www.interactions.com>
> <http://www.interactions.com/>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> *Lou Nicotra*
> >>>
> >>> IT Systems Engineer - SLT
> >>>
> >>> Interactions LLC
> >>>
> >>> o: 908-673-1833 <tel:781-405-5114>
> >>>
> >>> m: 908-451-6983 <tel:781-405-5114>
> >>>
> >>> _lnicotra at interactions.com
> <mailto:lnicotra at interactions.com> <mailto:lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>>_
> >>>
> >>> www.interactions.com <http://www.interactions.com>
> <http://www.interactions.com/>
> >>>
> >>>
> *******************************************************************************
> >>>
> >>> This e-mail and any of its attachments may contain Interactions
> >>> LLC proprietary information, which is privileged, confidential,
> >>> or subject to copyright belonging to the Interactions LLC. This
> >>> e-mail is intended solely for the use of the individual or
> entity
> >>> to which it is addressed. If you are not the intended recipient
> >>> of this e-mail, you are hereby notified that any dissemination,
> >>> distribution, copying, or action taken in relation to the
> >>> contents of and attachments to this e-mail is strictly
> prohibited
> >>> and may be unlawful. If you have received this e-mail in error,
> >>> please notify the sender immediately and permanently delete the
> >>> original and any copy of this e-mail and any printout.
> Thank You.
> >>>
> >>>
> *******************************************************************************
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Lou Nicotra*
> >>
> >> IT Systems Engineer - SLT
> >>
> >> Interactions LLC
> >>
> >> o: 908-673-1833 <tel:781-405-5114>
> >>
> >> m: 908-451-6983 <tel:781-405-5114>
> >>
> >> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>
> <mailto:lnicotra at interactions.com <mailto:lnicotra at interactions.com>>_
> >>
> >> www.interactions.com <http://www.interactions.com>
> <http://www.interactions.com/>
> >>
> >>
> *******************************************************************************
> >>
> >> This e-mail and any of its attachments may contain Interactions LLC
> >> proprietary information, which is privileged, confidential, or
> subject
> >> to copyright belonging to the Interactions LLC. This e-mail is
> >> intended solely for the use of the individual or entity to which
> it is
> >> addressed. If you are not the intended recipient of this e-mail,
> you
> >> are hereby notified that any dissemination, distribution,
> copying, or
> >> action taken in relation to the contents of and attachments to this
> >> e-mail is strictly prohibited and may be unlawful. If you have
> >> received this e-mail in error, please notify the sender immediately
> >> and permanently delete the original and any copy of this e-mail and
> >> any printout. Thank You.
> >>
> >>
> *******************************************************************************
> >>
> >
>
>
>
> --
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <tel:781-405-5114>
>
> m: 908-451-6983 <tel:781-405-5114>
>
> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
> www.interactions.com <http://www.interactions.com/>
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject
> to copyright belonging to the Interactions LLC. This e-mail is intended
> solely for the use of the individual or entity to which it is addressed.
> If you are not the intended recipient of this e-mail, you are hereby
> notified that any dissemination, distribution, copying, or action taken
> in relation to the contents of and attachments to this e-mail is
> strictly prohibited and may be unlawful. If you have received this
> e-mail in error, please notify the sender immediately and permanently
> delete the original and any copy of this e-mail and any printout. Thank
> You.
>
> *******************************************************************************
>
More information about the slurm-users
mailing list