[slurm-users] GRES GPU issues

Brian W. Johanson bjohanso at psc.edu
Tue Dec 4 08:20:59 MST 2018


Do one more pass through making sure
s/1080GTX/1080gtx and s/K20/k20

shutdown all slurmd, slurmctld, start slurmctl, start slurmd


I find it less confusing to have a global gres.conf file. I haven't used a list 
(nvidia[0-1), mainly because I want to specify thethe cores to use for each gpu.

gres.conf would look something like...

NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80 
File=/dev/nvidia0 Cores=0
NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80 
File=/dev/nvidia1 Cores=1
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0 Cores=0
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1 Cores=1

which can be distributed to all nodes.

-b


On 12/04/2018 09:55 AM, Lou Nicotra wrote:
> Brian, the specific node does not show any gres...
> root at panther02 slurm# scontrol show partition=tiger_1
> PartitionName=tiger_1
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=YES QoS=N/A
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
>    MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
>    Nodes=tiger[01-22]
>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=OFF
>    State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
>    JobDefaults=(null)
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
> root at panther02 slurm#  scontrol show node=tiger11
> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
>    CPUAlloc=0 CPUTot=48 CPULoad=11.50
>    AvailableFeatures=HyperThread
>    ActiveFeatures=HyperThread
>    Gres=(null)
>    NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
>    OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
>    RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
>    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=tiger_1,compute_1
>    BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
>    CfgTRES=cpu=48,mem=1M,billing=48
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> So, something is not setup correctly... Could it be a 18.X bug?
>
> Thanks.
>
>
> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra <lnicotra at interactions.com 
> <mailto:lnicotra at interactions.com>> wrote:
>
>     Thanks Michael. I will try 17.x as I also could not see anything wrong
>     with my settings... Will report back afterwards...
>
>     Lou
>
>     On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico <mdidomenico4 at gmail.com
>     <mailto:mdidomenico4 at gmail.com>> wrote:
>
>         unfortunately, someone smarter then me will have to help further.  I'm
>         not sure i see anything specifically wrong.  The one thing i might try
>         is backing the software down to a 17.x release series.  I recently
>         tried 18.x and had some issues.  I can't say whether it'll be any
>         different, but you might be exposing an undiagnosed bug in the 18.x
>         branch
>         On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra <lnicotra at interactions.com
>         <mailto:lnicotra at interactions.com>> wrote:
>         >
>         > Made the change in the gres.conf on local server file and restarted
>         slurmd and slurmctld on master.... Unfortunately same error...
>         >
>         > Distributed corrected gres.conf to all k20 servers, restarted slurmd
>         and slurmdctl...   Still has same error...
>         >
>         > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson <bjohanso at psc.edu
>         <mailto:bjohanso at psc.edu>> wrote:
>         >>
>         >> Is that a lowercase k in k20 specified in the batch script and
>         nodename and a uppercase K specified in gres.conf?
>         >>
>         >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>         >>
>         >> Hi All, I have recently set up a slurm cluster with my servers and
>         I'm running into an issue while submitting GPU jobs. It has something
>         to to with gres configurations, but I just can't seem to figure out
>         what is wrong. Non GPU jobs run fine.
>         >>
>         >> The error is as follows:
>         >> sbatch: error: Batch job submission failed: Invalid Trackable
>         RESource (TRES) specification  after submitting a batch job.
>         >>
>         >> My batch job is as follows:
>         >> #!/bin/bash
>         >> #SBATCH --partition=tiger_1   # partition name
>         >> #SBATCH --gres=gpu:k20:1
>         >> #SBATCH --gres-flags=enforce-binding
>         >> #SBATCH --time=0:20:00  # wall clock limit
>         >> #SBATCH --output=gpu-%J.txt
>         >> #SBATCH --account=lnicotra
>         >> module load cuda
>         >> python gpu1
>         >>
>         >> Where gpu1 is a GPU test script that runs correctly while invoked
>         via python. Tiger_1 partition has servers with GPUs, with a mix of
>         1080GTX and K20 as specified in slurm.conf
>         >>
>         >> I have defined GRES resources in the slurm.conf file:
>         >> # GPU GRES
>         >> GresTypes=gpu
>         >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>         >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>         >>
>         >> And have a local gres.conf on the servers containing GPUs...
>         >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>         >> # GPU Definitions
>         >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20
>         File=/dev/nvidia[0-1]
>         >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>         >>
>         >> and a similar one for the 1080GTX
>         >> # GPU Definitions
>         >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
>         File=/dev/nvidia[0-1]
>         >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>         >>
>         >> The account manager seems to know about the GPUs...
>         >> lnicotra at tiger11 ~# sacctmgr show tres
>         >>     Type            Name     ID
>         >> -------- --------------- ------
>         >>      cpu                      1
>         >>      mem                      2
>         >>   energy                      3
>         >>     node                      4
>         >>  billing                      5
>         >>       fs            disk      6
>         >>     vmem                      7
>         >>    pages                      8
>         >>     gres             gpu   1001
>         >>     gres         gpu:k20   1002
>         >>     gres     gpu:1080gtx   1003
>         >>
>         >> Can anyone point out what am I missing?
>         >>
>         >> Thanks!
>         >> Lou
>         >>
>         >>
>         >> --
>         >>
>         >> Lou Nicotra
>         >>
>         >> IT Systems Engineer - SLT
>         >>
>         >> Interactions LLC
>         >>
>         >> o:  908-673-1833
>         >>
>         >> m: 908-451-6983
>         >>
>         >> lnicotra at interactions.com <mailto:lnicotra at interactions.com>
>         >>
>         >> www.interactions.com <http://www.interactions.com>
>         >>
>         >>
>         *******************************************************************************
>         >>
>         >> This e-mail and any of its attachments may contain Interactions LLC
>         proprietary information, which is privileged, confidential, or subject
>         to copyright belonging to the Interactions LLC. This e-mail is
>         intended solely for the use of the individual or entity to which it is
>         addressed. If you are not the intended recipient of this e-mail, you
>         are hereby notified that any dissemination, distribution, copying, or
>         action taken in relation to the contents of and attachments to this
>         e-mail is strictly prohibited and may be unlawful. If you have
>         received this e-mail in error, please notify the sender immediately
>         and permanently delete the original and any copy of this e-mail and
>         any printout. Thank You.
>         >>
>         >>
>         *******************************************************************************
>         >>
>         >>
>         >
>         >
>         > --
>         >
>         > Lou Nicotra
>         >
>         > IT Systems Engineer - SLT
>         >
>         > Interactions LLC
>         >
>         > o:  908-673-1833
>         >
>         > m: 908-451-6983
>         >
>         > lnicotra at interactions.com <mailto:lnicotra at interactions.com>
>         >
>         > www.interactions.com <http://www.interactions.com>
>         >
>         >
>         *******************************************************************************
>         >
>         > This e-mail and any of its attachments may contain Interactions LLC
>         proprietary information, which is privileged, confidential, or subject
>         to copyright belonging to the Interactions LLC. This e-mail is
>         intended solely for the use of the individual or entity to which it is
>         addressed. If you are not the intended recipient of this e-mail, you
>         are hereby notified that any dissemination, distribution, copying, or
>         action taken in relation to the contents of and attachments to this
>         e-mail is strictly prohibited and may be unlawful. If you have
>         received this e-mail in error, please notify the sender immediately
>         and permanently delete the original and any copy of this e-mail and
>         any printout. Thank You.
>         >
>         >
>         *******************************************************************************
>
>
>
>     -- 
>
>     *Lou Nicotra*
>
>     IT Systems Engineer - SLT
>
>     Interactions LLC
>
>     o: 908-673-1833 <tel:781-405-5114>
>
>     m: 908-451-6983 <tel:781-405-5114>
>
>     _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
>     www.interactions.com <http://www.interactions.com/>
>
>
>
> -- 
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <tel:781-405-5114>
>
> m: 908-451-6983 <tel:781-405-5114>
>
> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
> www.interactions.com <http://www.interactions.com/>
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC 
> proprietary information, which is privileged, confidential, or subject to 
> copyright belonging to the Interactions LLC. This e-mail is intended solely 
> for the use of the individual or entity to which it is addressed. If you are 
> not the intended recipient of this e-mail, you are hereby notified that any 
> dissemination, distribution, copying, or action taken in relation to the 
> contents of and attachments to this e-mail is strictly prohibited and may be 
> unlawful. If you have received this e-mail in error, please notify the sender 
> immediately and permanently delete the original and any copy of this e-mail 
> and any printout. Thank You.
>
> *******************************************************************************
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181204/9b6a962d/attachment-0001.html>


More information about the slurm-users mailing list