[slurm-users] GRES GPU issues

Lou Nicotra lnicotra at interactions.com
Wed Dec 5 06:20:53 MST 2018


Tina, thanks for confirming that GPU GRES resources work with 18.08... I
might just upgrade to 18.08.03 as I am running 18.08.0

The nvidia devices exists on all servers and persistence is set. They have
been in there for a number of years and our users make use of them daily. I
can actually see that slurmd knows about them while restarting the daemon:
[2018-12-05T08:03:35.989] Slurmd shutdown completing
[2018-12-05T08:03:36.015] Message aggregation disabled
[2018-12-05T08:03:36.016] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2018-12-05T08:03:36.017] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2018-12-05T08:03:36.059] slurmd version 18.08.0 started
[2018-12-05T08:03:36.059] slurmd started on Wed, 05 Dec 2018 08:03:36 -0500
[2018-12-05T08:03:36.059] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
Memory=386757 TmpDisk=4758 Uptime=21324804 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)

Would you mind sharing the portions of the slurm.conf and corresponding
GRES definitions that you are using?. You have individual GRES files for
each server based on GPU type? I tried both, none of them work.

My slurm.conf file has entries for GPUs as follows:
GresTypes=gpu
#AccountingStorageTRES=gres/gpu,gres/gpu:k20,gres/gpu:1080gtx  (currently
commented out)

gres.conf is as follows (had tried different configs, no change with either
one...)
# GPU Definitions
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0
Cores=0
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1
Cores=1
#NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia[0-1]
Cores=0,1

NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
File=/dev/nvidia0 Cores=0
NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
File=/dev/nvidia1 Cores=1
#NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k20
File=/dev/nvidia[0-1] Cores=0,1

What am I missing?

Thanks...




On Wed, Dec 5, 2018 at 4:59 AM Tina Friedrich <tina.friedrich at it.ox.ac.uk>
wrote:

> I'm running 18.08.3, and I have a fair number of GPU GRES resources -
> recently upgraded to 18.08.03 from a 17.x release. It's definitely not
> as if they don't work in an 18.x release. (I do not distribute the same
> gres.conf file everywhere though, never tried that.)
>
> Just a really stupid question - the /dev/nvidiaX devices do exist, I
> assume? You are running nvidia-persistenced (or something similar) to
> ensure the cards are up & the device files initialised etc?
>
> Tina
>
> On 04/12/2018 23:36, Brian W. Johanson wrote:
> > Only thing to suggest once again is increasing the logging of both
> > slurmctl and slurmd.
> > As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a
> > db built with 18.x.  I imagine there are enough changes there to cause
> > trouble.
> > I don't imagine downgrading will fix your issue, if you are running
> > 18.08.0, the most recent release is 18.08.3.  NEWS packed in the
> > tarballs gives the fixes in the versions.  I don't see any that would
> > fit you case.
> >
> >
> > On 12/04/2018 02:11 PM, Lou Nicotra wrote:
> >> Brian, I used a single gres.conf file and distributed to all nodes...
> >> Restarted all daemons, unfortunately scontrol still does not show any
> >> Gres resources for GPU nodes...
> >>
> >> Will try to roll back to 17.X release. Is it basically a matter of
> >> removing 18.x rpms and installing 17's? Does the DB need to be
> >> downgraded also?
> >>
> >> Thanks...
> >> Lou
> >>
> >> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjohanso at psc.edu
> >> <mailto:bjohanso at psc.edu>> wrote:
> >>
> >>
> >>     Do one more pass through making sure
> >>     s/1080GTX/1080gtx and s/K20/k20
> >>
> >>     shutdown all slurmd, slurmctld, start slurmctl, start slurmd
> >>
> >>
> >>     I find it less confusing to have a global gres.conf file.  I
> >>     haven't used a list (nvidia[0-1), mainly because I want to specify
> >>     thethe cores to use for each gpu.
> >>
> >>     gres.conf would look something like...
> >>
> >>     NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> >>     File=/dev/nvidia0 Cores=0
> >>     NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> >>     File=/dev/nvidia1 Cores=1
> >>     NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
> >>     File=/dev/nvidia0 Cores=0
> >>     NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx
> >>     File=/dev/nvidia1 Cores=1
> >>
> >>     which can be distributed to all nodes.
> >>
> >>     -b
> >>
> >>
> >>     On 12/04/2018 09:55 AM, Lou Nicotra wrote:
> >>>     Brian, the specific node does not show any gres...
> >>>     root at panther02 slurm# scontrol show partition=tiger_1
> >>>     PartitionName=tiger_1
> >>>        AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> >>>        AllocNodes=ALL Default=YES QoS=N/A
> >>>        DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
> >>>     GraceTime=0 Hidden=NO
> >>>        MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
> >>>     MaxCPUsPerNode=UNLIMITED
> >>>        Nodes=tiger[01-22]
> >>>        PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> >>>     OverSubscribe=NO
> >>>        OverTimeLimit=NONE PreemptMode=OFF
> >>>        State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
> >>>        JobDefaults=(null)
> >>>        DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
> >>>
> >>>     root at panther02 slurm#  scontrol show node=tiger11
> >>>     NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
> >>>        CPUAlloc=0 CPUTot=48 CPULoad=11.50
> >>>        AvailableFeatures=HyperThread
> >>>        ActiveFeatures=HyperThread
> >>>        Gres=(null)
> >>>        NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
> >>>        OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC
> 2015
> >>>        RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
> >>>        State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> >>>     MCS_label=N/A
> >>>        Partitions=tiger_1,compute_1
> >>>        BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
> >>>        CfgTRES=cpu=48,mem=1M,billing=48
> >>>        AllocTRES=
> >>>        CapWatts=n/a
> >>>        CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >>>        ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >>>
> >>>     So, something is not setup correctly... Could it be a 18.X bug?
> >>>
> >>>     Thanks.
> >>>
> >>>
> >>>     On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra
> >>>     <lnicotra at interactions.com <mailto:lnicotra at interactions.com>>
> wrote:
> >>>
> >>>         Thanks Michael. I will try 17.x as I also could not see
> >>>         anything wrong with my settings... Will report back
> >>>         afterwards...
> >>>
> >>>         Lou
> >>>
> >>>         On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
> >>>         <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>>
> wrote:
> >>>
> >>>             unfortunately, someone smarter then me will have to help
> >>>             further.  I'm
> >>>             not sure i see anything specifically wrong.  The one
> >>>             thing i might try
> >>>             is backing the software down to a 17.x release series.  I
> >>>             recently
> >>>             tried 18.x and had some issues.  I can't say whether
> >>>             it'll be any
> >>>             different, but you might be exposing an undiagnosed bug
> >>>             in the 18.x
> >>>             branch
> >>>             On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
> >>>             <lnicotra at interactions.com
> >>>             <mailto:lnicotra at interactions.com>> wrote:
> >>>             >
> >>>             > Made the change in the gres.conf on local server file
> >>>             and restarted slurmd and slurmctld on master....
> >>>             Unfortunately same error...
> >>>             >
> >>>             > Distributed corrected gres.conf to all k20 servers,
> >>>             restarted slurmd and slurmdctl...   Still has same error...
> >>>             >
> >>>             > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
> >>>             <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
> >>>             >>
> >>>             >> Is that a lowercase k in k20 specified in the batch
> >>>             script and nodename and a uppercase K specified in
> gres.conf?
> >>>             >>
> >>>             >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
> >>>             >>
> >>>             >> Hi All, I have recently set up a slurm cluster with my
> >>>             servers and I'm running into an issue while submitting
> >>>             GPU jobs. It has something to to with gres
> >>>             configurations, but I just can't seem to figure out what
> >>>             is wrong. Non GPU jobs run fine.
> >>>             >>
> >>>             >> The error is as follows:
> >>>             >> sbatch: error: Batch job submission failed: Invalid
> >>>             Trackable RESource (TRES) specification  after submitting
> >>>             a batch job.
> >>>             >>
> >>>             >> My batch job is as follows:
> >>>             >> #!/bin/bash
> >>>             >> #SBATCH --partition=tiger_1   # partition name
> >>>             >> #SBATCH --gres=gpu:k20:1
> >>>             >> #SBATCH --gres-flags=enforce-binding
> >>>             >> #SBATCH --time=0:20:00  # wall clock limit
> >>>             >> #SBATCH --output=gpu-%J.txt
> >>>             >> #SBATCH --account=lnicotra
> >>>             >> module load cuda
> >>>             >> python gpu1
> >>>             >>
> >>>             >> Where gpu1 is a GPU test script that runs correctly
> >>>             while invoked via python. Tiger_1 partition has servers
> >>>             with GPUs, with a mix of 1080GTX and K20 as specified in
> >>>             slurm.conf
> >>>             >>
> >>>             >> I have defined GRES resources in the slurm.conf file:
> >>>             >> # GPU GRES
> >>>             >> GresTypes=gpu
> >>>             >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
> >>>             >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
> >>>             Gres=gpu:k20:2
> >>>             >>
> >>>             >> And have a local gres.conf on the servers containing
> >>>             GPUs...
> >>>             >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
> >>>             >> # GPU Definitions
> >>>             >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22]
> >>>             Name=gpu Type=K20 File=/dev/nvidia[0-1]
> >>>             >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
> >>>             >>
> >>>             >> and a similar one for the 1080GTX
> >>>             >> # GPU Definitions
> >>>             >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
> >>>             File=/dev/nvidia[0-1]
> >>>             >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
> >>>             >>
> >>>             >> The account manager seems to know about the GPUs...
> >>>             >> lnicotra at tiger11 ~# sacctmgr show tres
> >>>             >>     Type            Name     ID
> >>>             >> -------- --------------- ------
> >>>             >>      cpu                      1
> >>>             >>      mem                      2
> >>>             >>   energy                      3
> >>>             >>     node                      4
> >>>             >>  billing                      5
> >>>             >>       fs            disk      6
> >>>             >>     vmem                      7
> >>>             >>    pages                      8
> >>>             >>     gres             gpu   1001
> >>>             >>     gres         gpu:k20   1002
> >>>             >>     gres     gpu:1080gtx   1003
> >>>             >>
> >>>             >> Can anyone point out what am I missing?
> >>>             >>
> >>>             >> Thanks!
> >>>             >> Lou
> >>>             >>
> >>>             >>
> >>>             >> --
> >>>             >>
> >>>             >> Lou Nicotra
> >>>             >>
> >>>             >> IT Systems Engineer - SLT
> >>>             >>
> >>>             >> Interactions LLC
> >>>             >>
> >>>             >> o:  908-673-1833
> >>>             >>
> >>>             >> m: 908-451-6983
> >>>             >>
> >>>             >> lnicotra at interactions.com
> >>>             <mailto:lnicotra at interactions.com>
> >>>             >>
> >>>             >> www.interactions.com <http://www.interactions.com>
> >>>             >>
> >>>             >>
> >>>
>  *******************************************************************************
> >>>             >>
> >>>             >> This e-mail and any of its attachments may contain
> >>>             Interactions LLC proprietary information, which is
> >>>             privileged, confidential, or subject to copyright
> >>>             belonging to the Interactions LLC. This e-mail is
> >>>             intended solely for the use of the individual or entity
> >>>             to which it is addressed. If you are not the intended
> >>>             recipient of this e-mail, you are hereby notified that
> >>>             any dissemination, distribution, copying, or action taken
> >>>             in relation to the contents of and attachments to this
> >>>             e-mail is strictly prohibited and may be unlawful. If you
> >>>             have received this e-mail in error, please notify the
> >>>             sender immediately and permanently delete the original
> >>>             and any copy of this e-mail and any printout. Thank You.
> >>>             >>
> >>>             >>
> >>>
>  *******************************************************************************
> >>>             >>
> >>>             >>
> >>>             >
> >>>             >
> >>>             > --
> >>>             >
> >>>             > Lou Nicotra
> >>>             >
> >>>             > IT Systems Engineer - SLT
> >>>             >
> >>>             > Interactions LLC
> >>>             >
> >>>             > o:  908-673-1833
> >>>             >
> >>>             > m: 908-451-6983
> >>>             >
> >>>             > lnicotra at interactions.com
> >>>             <mailto:lnicotra at interactions.com>
> >>>             >
> >>>             > www.interactions.com <http://www.interactions.com>
> >>>             >
> >>>             >
> >>>
>  *******************************************************************************
> >>>             >
> >>>             > This e-mail and any of its attachments may contain
> >>>             Interactions LLC proprietary information, which is
> >>>             privileged, confidential, or subject to copyright
> >>>             belonging to the Interactions LLC. This e-mail is
> >>>             intended solely for the use of the individual or entity
> >>>             to which it is addressed. If you are not the intended
> >>>             recipient of this e-mail, you are hereby notified that
> >>>             any dissemination, distribution, copying, or action taken
> >>>             in relation to the contents of and attachments to this
> >>>             e-mail is strictly prohibited and may be unlawful. If you
> >>>             have received this e-mail in error, please notify the
> >>>             sender immediately and permanently delete the original
> >>>             and any copy of this e-mail and any printout. Thank You.
> >>>             >
> >>>             >
> >>>
>  *******************************************************************************
> >>>
> >>>
> >>>
> >>>         --
> >>>
> >>>         *Lou Nicotra*
> >>>
> >>>         IT Systems Engineer - SLT
> >>>
> >>>         Interactions LLC
> >>>
> >>>         o: 908-673-1833 <tel:781-405-5114>
> >>>
> >>>         m: 908-451-6983 <tel:781-405-5114>
> >>>
> >>>         _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
> >>>
> >>>         www.interactions.com <http://www.interactions.com/>
> >>>
> >>>
> >>>
> >>>     --
> >>>
> >>>     *Lou Nicotra*
> >>>
> >>>     IT Systems Engineer - SLT
> >>>
> >>>     Interactions LLC
> >>>
> >>>     o: 908-673-1833 <tel:781-405-5114>
> >>>
> >>>     m: 908-451-6983 <tel:781-405-5114>
> >>>
> >>>     _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
> >>>
> >>>     www.interactions.com <http://www.interactions.com/>
> >>>
> >>>
>  *******************************************************************************
> >>>
> >>>     This e-mail and any of its attachments may contain Interactions
> >>>     LLC proprietary information, which is privileged, confidential,
> >>>     or subject to copyright belonging to the Interactions LLC. This
> >>>     e-mail is intended solely for the use of the individual or entity
> >>>     to which it is addressed. If you are not the intended recipient
> >>>     of this e-mail, you are hereby notified that any dissemination,
> >>>     distribution, copying, or action taken in relation to the
> >>>     contents of and attachments to this e-mail is strictly prohibited
> >>>     and may be unlawful. If you have received this e-mail in error,
> >>>     please notify the sender immediately and permanently delete the
> >>>     original and any copy of this e-mail and any printout. Thank You.
> >>>
> >>>
>  *******************************************************************************
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Lou Nicotra*
> >>
> >> IT Systems Engineer - SLT
> >>
> >> Interactions LLC
> >>
> >> o: 908-673-1833 <tel:781-405-5114>
> >>
> >> m: 908-451-6983 <tel:781-405-5114>
> >>
> >> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
> >>
> >> www.interactions.com <http://www.interactions.com/>
> >>
> >>
> *******************************************************************************
> >>
> >> This e-mail and any of its attachments may contain Interactions LLC
> >> proprietary information, which is privileged, confidential, or subject
> >> to copyright belonging to the Interactions LLC. This e-mail is
> >> intended solely for the use of the individual or entity to which it is
> >> addressed. If you are not the intended recipient of this e-mail, you
> >> are hereby notified that any dissemination, distribution, copying, or
> >> action taken in relation to the contents of and attachments to this
> >> e-mail is strictly prohibited and may be unlawful. If you have
> >> received this e-mail in error, please notify the sender immediately
> >> and permanently delete the original and any copy of this e-mail and
> >> any printout. Thank You.
> >>
> >>
> *******************************************************************************
> >>
> >
>


-- 

*Lou Nicotra*

IT Systems Engineer - SLT

Interactions LLC

o:  908-673-1833 <781-405-5114>

m: 908-451-6983 <781-405-5114>

*lnicotra at interactions.com <lnicotra at interactions.com>*
www.interactions.com

-- 





*******************************************************************************




This e-mail and any of its attachments may contain
Interactions LLC 
proprietary information, which is privileged,
confidential, or subject to 
copyright belonging to the Interactions
LLC. This e-mail is intended solely 
for the use of the individual or
entity to which it is addressed. If you 
are not the intended recipient of this
e-mail, you are hereby notified that 
any dissemination, distribution, copying,
or action taken in relation to 
the contents of and attachments to this e-mail
is strictly prohibited and 
may be unlawful. If you have received this e-mail in
error, please notify 
the sender immediately and permanently delete the original
and any copy of 
this e-mail and any printout. Thank You.  




******************************************************************************* 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181205/8efc9e56/attachment-0001.html>


More information about the slurm-users mailing list