[slurm-users] GRES GPU issues

Brian W. Johanson bjohanso at psc.edu
Tue Dec 4 16:36:12 MST 2018


Only thing to suggest once again is increasing the logging of both slurmctl and 
slurmd.
As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a db 
built with 18.x.  I imagine there are enough changes there to cause trouble.
I don't imagine downgrading will fix your issue, if you are running 18.08.0, the 
most recent release is 18.08.3.  NEWS packed in the tarballs gives the fixes in 
the versions.  I don't see any that would fit you case.


On 12/04/2018 02:11 PM, Lou Nicotra wrote:
> Brian, I used a single gres.conf file and distributed to all nodes... 
> Restarted all daemons, unfortunately scontrol still does not show any Gres 
> resources for GPU nodes...
>
> Will try to roll back to 17.X release. Is it basically a matter of removing 
> 18.x rpms and installing 17's? Does the DB need to be downgraded also?
>
> Thanks...
> Lou
>
> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjohanso at psc.edu 
> <mailto:bjohanso at psc.edu>> wrote:
>
>
>     Do one more pass through making sure
>     s/1080GTX/1080gtx and s/K20/k20
>
>     shutdown all slurmd, slurmctld, start slurmctl, start slurmd
>
>
>     I find it less confusing to have a global gres.conf file.  I haven't used
>     a list (nvidia[0-1), mainly because I want to specify thethe cores to use
>     for each gpu.
>
>     gres.conf would look something like...
>
>     NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
>     File=/dev/nvidia0 Cores=0
>     NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
>     File=/dev/nvidia1 Cores=1
>     NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0 Cores=0
>     NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1 Cores=1
>
>     which can be distributed to all nodes.
>
>     -b
>
>
>     On 12/04/2018 09:55 AM, Lou Nicotra wrote:
>>     Brian, the specific node does not show any gres...
>>     root at panther02 slurm# scontrol show partition=tiger_1
>>     PartitionName=tiger_1
>>        AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>>        AllocNodes=ALL Default=YES QoS=N/A
>>        DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
>>        MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
>>     MaxCPUsPerNode=UNLIMITED
>>        Nodes=tiger[01-22]
>>        PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
>>        OverTimeLimit=NONE PreemptMode=OFF
>>        State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
>>        JobDefaults=(null)
>>        DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>>
>>     root at panther02 slurm#  scontrol show node=tiger11
>>     NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
>>        CPUAlloc=0 CPUTot=48 CPULoad=11.50
>>        AvailableFeatures=HyperThread
>>        ActiveFeatures=HyperThread
>>        Gres=(null)
>>        NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
>>        OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
>>        RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
>>        State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>        Partitions=tiger_1,compute_1
>>        BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
>>        CfgTRES=cpu=48,mem=1M,billing=48
>>        AllocTRES=
>>        CapWatts=n/a
>>        CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>        ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>     So, something is not setup correctly... Could it be a 18.X bug?
>>
>>     Thanks.
>>
>>
>>     On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra <lnicotra at interactions.com
>>     <mailto:lnicotra at interactions.com>> wrote:
>>
>>         Thanks Michael. I will try 17.x as I also could not see anything
>>         wrong with my settings... Will report back afterwards...
>>
>>         Lou
>>
>>         On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
>>         <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>> wrote:
>>
>>             unfortunately, someone smarter then me will have to help
>>             further.  I'm
>>             not sure i see anything specifically wrong.  The one thing i
>>             might try
>>             is backing the software down to a 17.x release series.  I recently
>>             tried 18.x and had some issues.  I can't say whether it'll be any
>>             different, but you might be exposing an undiagnosed bug in the 18.x
>>             branch
>>             On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
>>             <lnicotra at interactions.com <mailto:lnicotra at interactions.com>> wrote:
>>             >
>>             > Made the change in the gres.conf on local server file and
>>             restarted slurmd and slurmctld on master.... Unfortunately same
>>             error...
>>             >
>>             > Distributed corrected gres.conf to all k20 servers, restarted
>>             slurmd and slurmdctl...   Still has same error...
>>             >
>>             > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
>>             <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
>>             >>
>>             >> Is that a lowercase k in k20 specified in the batch script and
>>             nodename and a uppercase K specified in gres.conf?
>>             >>
>>             >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>>             >>
>>             >> Hi All, I have recently set up a slurm cluster with my servers
>>             and I'm running into an issue while submitting GPU jobs. It has
>>             something to to with gres configurations, but I just can't seem
>>             to figure out what is wrong. Non GPU jobs run fine.
>>             >>
>>             >> The error is as follows:
>>             >> sbatch: error: Batch job submission failed: Invalid Trackable
>>             RESource (TRES) specification  after submitting a batch job.
>>             >>
>>             >> My batch job is as follows:
>>             >> #!/bin/bash
>>             >> #SBATCH --partition=tiger_1   # partition name
>>             >> #SBATCH --gres=gpu:k20:1
>>             >> #SBATCH --gres-flags=enforce-binding
>>             >> #SBATCH --time=0:20:00  # wall clock limit
>>             >> #SBATCH --output=gpu-%J.txt
>>             >> #SBATCH --account=lnicotra
>>             >> module load cuda
>>             >> python gpu1
>>             >>
>>             >> Where gpu1 is a GPU test script that runs correctly while
>>             invoked via python. Tiger_1 partition has servers with GPUs, with
>>             a mix of 1080GTX and K20 as specified in slurm.conf
>>             >>
>>             >> I have defined GRES resources in the slurm.conf file:
>>             >> # GPU GRES
>>             >> GresTypes=gpu
>>             >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>>             >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>>             >>
>>             >> And have a local gres.conf on the servers containing GPUs...
>>             >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>>             >> # GPU Definitions
>>             >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu
>>             Type=K20 File=/dev/nvidia[0-1]
>>             >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>>             >>
>>             >> and a similar one for the 1080GTX
>>             >> # GPU Definitions
>>             >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
>>             File=/dev/nvidia[0-1]
>>             >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>>             >>
>>             >> The account manager seems to know about the GPUs...
>>             >> lnicotra at tiger11 ~# sacctmgr show tres
>>             >>     Type            Name     ID
>>             >> -------- --------------- ------
>>             >>      cpu                      1
>>             >>      mem                      2
>>             >>   energy                      3
>>             >>     node                      4
>>             >>  billing                      5
>>             >>       fs            disk      6
>>             >>     vmem                      7
>>             >>    pages                      8
>>             >>     gres             gpu   1001
>>             >>     gres         gpu:k20   1002
>>             >>     gres     gpu:1080gtx   1003
>>             >>
>>             >> Can anyone point out what am I missing?
>>             >>
>>             >> Thanks!
>>             >> Lou
>>             >>
>>             >>
>>             >> --
>>             >>
>>             >> Lou Nicotra
>>             >>
>>             >> IT Systems Engineer - SLT
>>             >>
>>             >> Interactions LLC
>>             >>
>>             >> o:  908-673-1833
>>             >>
>>             >> m: 908-451-6983
>>             >>
>>             >> lnicotra at interactions.com <mailto:lnicotra at interactions.com>
>>             >>
>>             >> www.interactions.com <http://www.interactions.com>
>>             >>
>>             >>
>>             *******************************************************************************
>>             >>
>>             >> This e-mail and any of its attachments may contain
>>             Interactions LLC proprietary information, which is privileged,
>>             confidential, or subject to copyright belonging to the
>>             Interactions LLC. This e-mail is intended solely for the use of
>>             the individual or entity to which it is addressed. If you are not
>>             the intended recipient of this e-mail, you are hereby notified
>>             that any dissemination, distribution, copying, or action taken in
>>             relation to the contents of and attachments to this e-mail is
>>             strictly prohibited and may be unlawful. If you have received
>>             this e-mail in error, please notify the sender immediately and
>>             permanently delete the original and any copy of this e-mail and
>>             any printout. Thank You.
>>             >>
>>             >>
>>             *******************************************************************************
>>             >>
>>             >>
>>             >
>>             >
>>             > --
>>             >
>>             > Lou Nicotra
>>             >
>>             > IT Systems Engineer - SLT
>>             >
>>             > Interactions LLC
>>             >
>>             > o:  908-673-1833
>>             >
>>             > m: 908-451-6983
>>             >
>>             > lnicotra at interactions.com <mailto:lnicotra at interactions.com>
>>             >
>>             > www.interactions.com <http://www.interactions.com>
>>             >
>>             >
>>             *******************************************************************************
>>             >
>>             > This e-mail and any of its attachments may contain Interactions
>>             LLC proprietary information, which is privileged, confidential,
>>             or subject to copyright belonging to the Interactions LLC. This
>>             e-mail is intended solely for the use of the individual or entity
>>             to which it is addressed. If you are not the intended recipient
>>             of this e-mail, you are hereby notified that any dissemination,
>>             distribution, copying, or action taken in relation to the
>>             contents of and attachments to this e-mail is strictly prohibited
>>             and may be unlawful. If you have received this e-mail in error,
>>             please notify the sender immediately and permanently delete the
>>             original and any copy of this e-mail and any printout. Thank You.
>>             >
>>             >
>>             *******************************************************************************
>>
>>
>>
>>         -- 
>>
>>         *Lou Nicotra*
>>
>>         IT Systems Engineer - SLT
>>
>>         Interactions LLC
>>
>>         o: 908-673-1833 <tel:781-405-5114>
>>
>>         m: 908-451-6983 <tel:781-405-5114>
>>
>>         _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>
>>         www.interactions.com <http://www.interactions.com/>
>>
>>
>>
>>     -- 
>>
>>     *Lou Nicotra*
>>
>>     IT Systems Engineer - SLT
>>
>>     Interactions LLC
>>
>>     o: 908-673-1833 <tel:781-405-5114>
>>
>>     m: 908-451-6983 <tel:781-405-5114>
>>
>>     _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>
>>     www.interactions.com <http://www.interactions.com/>
>>
>>     *******************************************************************************
>>
>>     This e-mail and any of its attachments may contain Interactions LLC
>>     proprietary information, which is privileged, confidential, or subject to
>>     copyright belonging to the Interactions LLC. This e-mail is intended
>>     solely for the use of the individual or entity to which it is addressed.
>>     If you are not the intended recipient of this e-mail, you are hereby
>>     notified that any dissemination, distribution, copying, or action taken
>>     in relation to the contents of and attachments to this e-mail is strictly
>>     prohibited and may be unlawful. If you have received this e-mail in
>>     error, please notify the sender immediately and permanently delete the
>>     original and any copy of this e-mail and any printout. Thank You.
>>
>>     *******************************************************************************
>>
>
>
>
> -- 
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <tel:781-405-5114>
>
> m: 908-451-6983 <tel:781-405-5114>
>
> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
> www.interactions.com <http://www.interactions.com/>
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC 
> proprietary information, which is privileged, confidential, or subject to 
> copyright belonging to the Interactions LLC. This e-mail is intended solely 
> for the use of the individual or entity to which it is addressed. If you are 
> not the intended recipient of this e-mail, you are hereby notified that any 
> dissemination, distribution, copying, or action taken in relation to the 
> contents of and attachments to this e-mail is strictly prohibited and may be 
> unlawful. If you have received this e-mail in error, please notify the sender 
> immediately and permanently delete the original and any copy of this e-mail 
> and any printout. Thank You.
>
> *******************************************************************************
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181204/7569e29c/attachment-0001.html>


More information about the slurm-users mailing list