[slurm-users] GRES GPU issues
Lou Nicotra
lnicotra at interactions.com
Wed Dec 5 06:20:53 MST 2018
Tina, thanks for confirming that GPU GRES resources work with 18.08... I
might just upgrade to 18.08.03 as I am running 18.08.0
The nvidia devices exists on all servers and persistence is set. They have
been in there for a number of years and our users make use of them daily. I
can actually see that slurmd knows about them while restarting the daemon:
[2018-12-05T08:03:35.989] Slurmd shutdown completing
[2018-12-05T08:03:36.015] Message aggregation disabled
[2018-12-05T08:03:36.016] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2018-12-05T08:03:36.017] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2018-12-05T08:03:36.059] slurmd version 18.08.0 started
[2018-12-05T08:03:36.059] slurmd started on Wed, 05 Dec 2018 08:03:36 -0500
[2018-12-05T08:03:36.059] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
Memory=386757 TmpDisk=4758 Uptime=21324804 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
Would you mind sharing the portions of the slurm.conf and corresponding
GRES definitions that you are using?. You have individual GRES files for
each server based on GPU type? I tried both, none of them work.
My slurm.conf file has entries for GPUs as follows:
GresTypes=gpu
#AccountingStorageTRES=gres/gpu,gres/gpu:k20,gres/gpu:1080gtx (currently
commented out)
gres.conf is as follows (had tried different configs, no change with either
one...)
# GPU Definitions
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0
Cores=0
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1
Cores=1
#NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia[0-1]
Cores=0,1
NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
File=/dev/nvidia0 Cores=0
NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
File=/dev/nvidia1 Cores=1
#NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
File=/dev/nvidia[0-1] Cores=0,1
What am I missing?
Thanks...
On Wed, Dec 5, 2018 at 4:59 AM Tina Friedrich <tina.friedrich at it.ox.ac.uk>
wrote:
> I'm running 18.08.3, and I have a fair number of GPU GRES resources -
> recently upgraded to 18.08.03 from a 17.x release. It's definitely not
> as if they don't work in an 18.x release. (I do not distribute the same
> gres.conf file everywhere though, never tried that.)
>
> Just a really stupid question - the /dev/nvidiaX devices do exist, I
> assume? You are running nvidia-persistenced (or something similar) to
> ensure the cards are up & the device files initialised etc?
>
> Tina
>
> On 04/12/2018 23:36, Brian W. Johanson wrote:
> > Only thing to suggest once again is increasing the logging of both
> > slurmctl and slurmd.
> > As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a
> > db built with 18.x. I imagine there are enough changes there to cause
> > trouble.
> > I don't imagine downgrading will fix your issue, if you are running
> > 18.08.0, the most recent release is 18.08.3. NEWS packed in the
> > tarballs gives the fixes in the versions. I don't see any that would
> > fit you case.
> >
> >
> > On 12/04/2018 02:11 PM, Lou Nicotra wrote:
> >> Brian, I used a single gres.conf file and distributed to all nodes...
> >> Restarted all daemons, unfortunately scontrol still does not show any
> >> Gres resources for GPU nodes...
> >>
> >> Will try to roll back to 17.X release. Is it basically a matter of
> >> removing 18.x rpms and installing 17's? Does the DB need to be
> >> downgraded also?
> >>
> >> Thanks...
> >> Lou
> >>
> >> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjohanso at psc.edu
> >> <mailto:bjohanso at psc.edu>> wrote:
> >>
> >>
> >> Do one more pass through making sure
> >> s/1080GTX/1080gtx and s/K20/k20
> >>
> >> shutdown all slurmd, slurmctld, start slurmctl, start slurmd
> >>
> >>
> >> I find it less confusing to have a global gres.conf file. I
> >> haven't used a list (nvidia[0-1), mainly because I want to specify
> >> thethe cores to use for each gpu.
> >>
> >> gres.conf would look something like...
> >>
> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> >> File=/dev/nvidia0 Cores=0
> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> >> File=/dev/nvidia1 Cores=1
> >> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
> >> File=/dev/nvidia0 Cores=0
> >> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
> >> File=/dev/nvidia1 Cores=1
> >>
> >> which can be distributed to all nodes.
> >>
> >> -b
> >>
> >>
> >> On 12/04/2018 09:55 AM, Lou Nicotra wrote:
> >>> Brian, the specific node does not show any gres...
> >>> root at panther02 slurm# scontrol show partition=tiger_1
> >>> PartitionName=tiger_1
> >>> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> >>> AllocNodes=ALL Default=YES QoS=N/A
> >>> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
> >>> GraceTime=0 Hidden=NO
> >>> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
> >>> MaxCPUsPerNode=UNLIMITED
> >>> Nodes=tiger[01-22]
> >>> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> >>> OverSubscribe=NO
> >>> OverTimeLimit=NONE PreemptMode=OFF
> >>> State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
> >>> JobDefaults=(null)
> >>> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
> >>>
> >>> root at panther02 slurm# scontrol show node=tiger11
> >>> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
> >>> CPUAlloc=0 CPUTot=48 CPULoad=11.50
> >>> AvailableFeatures=HyperThread
> >>> ActiveFeatures=HyperThread
> >>> Gres=(null)
> >>> NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
> >>> OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC
> 2015
> >>> RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
> >>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> >>> MCS_label=N/A
> >>> Partitions=tiger_1,compute_1
> >>> BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
> >>> CfgTRES=cpu=48,mem=1M,billing=48
> >>> AllocTRES=
> >>> CapWatts=n/a
> >>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >>>
> >>> So, something is not setup correctly... Could it be a 18.X bug?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra
> >>> <lnicotra at interactions.com <mailto:lnicotra at interactions.com>>
> wrote:
> >>>
> >>> Thanks Michael. I will try 17.x as I also could not see
> >>> anything wrong with my settings... Will report back
> >>> afterwards...
> >>>
> >>> Lou
> >>>
> >>> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
> >>> <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>>
> wrote:
> >>>
> >>> unfortunately, someone smarter then me will have to help
> >>> further. I'm
> >>> not sure i see anything specifically wrong. The one
> >>> thing i might try
> >>> is backing the software down to a 17.x release series. I
> >>> recently
> >>> tried 18.x and had some issues. I can't say whether
> >>> it'll be any
> >>> different, but you might be exposing an undiagnosed bug
> >>> in the 18.x
> >>> branch
> >>> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
> >>> <lnicotra at interactions.com
> >>> <mailto:lnicotra at interactions.com>> wrote:
> >>> >
> >>> > Made the change in the gres.conf on local server file
> >>> and restarted slurmd and slurmctld on master....
> >>> Unfortunately same error...
> >>> >
> >>> > Distributed corrected gres.conf to all k20 servers,
> >>> restarted slurmd and slurmdctl... Still has same error...
> >>> >
> >>> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
> >>> <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
> >>> >>
> >>> >> Is that a lowercase k in k20 specified in the batch
> >>> script and nodename and a uppercase K specified in
> gres.conf?
> >>> >>
> >>> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
> >>> >>
> >>> >> Hi All, I have recently set up a slurm cluster with my
> >>> servers and I'm running into an issue while submitting
> >>> GPU jobs. It has something to to with gres
> >>> configurations, but I just can't seem to figure out what
> >>> is wrong. Non GPU jobs run fine.
> >>> >>
> >>> >> The error is as follows:
> >>> >> sbatch: error: Batch job submission failed: Invalid
> >>> Trackable RESource (TRES) specification after submitting
> >>> a batch job.
> >>> >>
> >>> >> My batch job is as follows:
> >>> >> #!/bin/bash
> >>> >> #SBATCH --partition=tiger_1 # partition name
> >>> >> #SBATCH --gres=gpu:k20:1
> >>> >> #SBATCH --gres-flags=enforce-binding
> >>> >> #SBATCH --time=0:20:00 # wall clock limit
> >>> >> #SBATCH --output=gpu-%J.txt
> >>> >> #SBATCH --account=lnicotra
> >>> >> module load cuda
> >>> >> python gpu1
> >>> >>
> >>> >> Where gpu1 is a GPU test script that runs correctly
> >>> while invoked via python. Tiger_1 partition has servers
> >>> with GPUs, with a mix of 1080GTX and K20 as specified in
> >>> slurm.conf
> >>> >>
> >>> >> I have defined GRES resources in the slurm.conf file:
> >>> >> # GPU GRES
> >>> >> GresTypes=gpu
> >>> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
> >>> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
> >>> Gres=gpu:k20:2
> >>> >>
> >>> >> And have a local gres.conf on the servers containing
> >>> GPUs...
> >>> >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
> >>> >> # GPU Definitions
> >>> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
> >>> Name=gpu Type=K20 File=/dev/nvidia[0-1]
> >>> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
> >>> >>
> >>> >> and a similar one for the 1080GTX
> >>> >> # GPU Definitions
> >>> >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
> >>> File=/dev/nvidia[0-1]
> >>> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
> >>> >>
> >>> >> The account manager seems to know about the GPUs...
> >>> >> lnicotra at tiger11 ~# sacctmgr show tres
> >>> >> Type Name ID
> >>> >> -------- --------------- ------
> >>> >> cpu 1
> >>> >> mem 2
> >>> >> energy 3
> >>> >> node 4
> >>> >> billing 5
> >>> >> fs disk 6
> >>> >> vmem 7
> >>> >> pages 8
> >>> >> gres gpu 1001
> >>> >> gres gpu:k20 1002
> >>> >> gres gpu:1080gtx 1003
> >>> >>
> >>> >> Can anyone point out what am I missing?
> >>> >>
> >>> >> Thanks!
> >>> >> Lou
> >>> >>
> >>> >>
> >>> >> --
> >>> >>
> >>> >> Lou Nicotra
> >>> >>
> >>> >> IT Systems Engineer - SLT
> >>> >>
> >>> >> Interactions LLC
> >>> >>
> >>> >> o: 908-673-1833
> >>> >>
> >>> >> m: 908-451-6983
> >>> >>
> >>> >> lnicotra at interactions.com
> >>> <mailto:lnicotra at interactions.com>
> >>> >>
> >>> >> www.interactions.com <http://www.interactions.com>
> >>> >>
> >>> >>
> >>>
> *******************************************************************************
> >>> >>
> >>> >> This e-mail and any of its attachments may contain
> >>> Interactions LLC proprietary information, which is
> >>> privileged, confidential, or subject to copyright
> >>> belonging to the Interactions LLC. This e-mail is
> >>> intended solely for the use of the individual or entity
> >>> to which it is addressed. If you are not the intended
> >>> recipient of this e-mail, you are hereby notified that
> >>> any dissemination, distribution, copying, or action taken
> >>> in relation to the contents of and attachments to this
> >>> e-mail is strictly prohibited and may be unlawful. If you
> >>> have received this e-mail in error, please notify the
> >>> sender immediately and permanently delete the original
> >>> and any copy of this e-mail and any printout. Thank You.
> >>> >>
> >>> >>
> >>>
> *******************************************************************************
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> > --
> >>> >
> >>> > Lou Nicotra
> >>> >
> >>> > IT Systems Engineer - SLT
> >>> >
> >>> > Interactions LLC
> >>> >
> >>> > o: 908-673-1833
> >>> >
> >>> > m: 908-451-6983
> >>> >
> >>> > lnicotra at interactions.com
> >>> <mailto:lnicotra at interactions.com>
> >>> >
> >>> > www.interactions.com <http://www.interactions.com>
> >>> >
> >>> >
> >>>
> *******************************************************************************
> >>> >
> >>> > This e-mail and any of its attachments may contain
> >>> Interactions LLC proprietary information, which is
> >>> privileged, confidential, or subject to copyright
> >>> belonging to the Interactions LLC. This e-mail is
> >>> intended solely for the use of the individual or entity
> >>> to which it is addressed. If you are not the intended
> >>> recipient of this e-mail, you are hereby notified that
> >>> any dissemination, distribution, copying, or action taken
> >>> in relation to the contents of and attachments to this
> >>> e-mail is strictly prohibited and may be unlawful. If you
> >>> have received this e-mail in error, please notify the
> >>> sender immediately and permanently delete the original
> >>> and any copy of this e-mail and any printout. Thank You.
> >>> >
> >>> >
> >>>
> *******************************************************************************
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> *Lou Nicotra*
> >>>
> >>> IT Systems Engineer - SLT
> >>>
> >>> Interactions LLC
> >>>
> >>> o: 908-673-1833 <tel:781-405-5114>
> >>>
> >>> m: 908-451-6983 <tel:781-405-5114>
> >>>
> >>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
> >>>
> >>> www.interactions.com <http://www.interactions.com/>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> *Lou Nicotra*
> >>>
> >>> IT Systems Engineer - SLT
> >>>
> >>> Interactions LLC
> >>>
> >>> o: 908-673-1833 <tel:781-405-5114>
> >>>
> >>> m: 908-451-6983 <tel:781-405-5114>
> >>>
> >>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
> >>>
> >>> www.interactions.com <http://www.interactions.com/>
> >>>
> >>>
> *******************************************************************************
> >>>
> >>> This e-mail and any of its attachments may contain Interactions
> >>> LLC proprietary information, which is privileged, confidential,
> >>> or subject to copyright belonging to the Interactions LLC. This
> >>> e-mail is intended solely for the use of the individual or entity
> >>> to which it is addressed. If you are not the intended recipient
> >>> of this e-mail, you are hereby notified that any dissemination,
> >>> distribution, copying, or action taken in relation to the
> >>> contents of and attachments to this e-mail is strictly prohibited
> >>> and may be unlawful. If you have received this e-mail in error,
> >>> please notify the sender immediately and permanently delete the
> >>> original and any copy of this e-mail and any printout. Thank You.
> >>>
> >>>
> *******************************************************************************
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Lou Nicotra*
> >>
> >> IT Systems Engineer - SLT
> >>
> >> Interactions LLC
> >>
> >> o: 908-673-1833 <tel:781-405-5114>
> >>
> >> m: 908-451-6983 <tel:781-405-5114>
> >>
> >> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
> >>
> >> www.interactions.com <http://www.interactions.com/>
> >>
> >>
> *******************************************************************************
> >>
> >> This e-mail and any of its attachments may contain Interactions LLC
> >> proprietary information, which is privileged, confidential, or subject
> >> to copyright belonging to the Interactions LLC. This e-mail is
> >> intended solely for the use of the individual or entity to which it is
> >> addressed. If you are not the intended recipient of this e-mail, you
> >> are hereby notified that any dissemination, distribution, copying, or
> >> action taken in relation to the contents of and attachments to this
> >> e-mail is strictly prohibited and may be unlawful. If you have
> >> received this e-mail in error, please notify the sender immediately
> >> and permanently delete the original and any copy of this e-mail and
> >> any printout. Thank You.
> >>
> >>
> *******************************************************************************
> >>
> >
>
--
*Lou Nicotra*
IT Systems Engineer - SLT
Interactions LLC
o: 908-673-1833 <781-405-5114>
m: 908-451-6983 <781-405-5114>
*lnicotra at interactions.com <lnicotra at interactions.com>*
www.interactions.com
--
*******************************************************************************
This e-mail and any of its attachments may contain
Interactions LLC
proprietary information, which is privileged,
confidential, or subject to
copyright belonging to the Interactions
LLC. This e-mail is intended solely
for the use of the individual or
entity to which it is addressed. If you
are not the intended recipient of this
e-mail, you are hereby notified that
any dissemination, distribution, copying,
or action taken in relation to
the contents of and attachments to this e-mail
is strictly prohibited and
may be unlawful. If you have received this e-mail in
error, please notify
the sender immediately and permanently delete the original
and any copy of
this e-mail and any printout. Thank You.
*******************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181205/8efc9e56/attachment-0001.html>
More information about the slurm-users
mailing list