[slurm-users] GRES GPU issues
Brian W. Johanson
bjohanso at psc.edu
Tue Dec 4 08:20:59 MST 2018
Do one more pass through making sure
s/1080GTX/1080gtx and s/K20/k20
shutdown all slurmd, slurmctld, start slurmctl, start slurmd
I find it less confusing to have a global gres.conf file. I haven't used a list
(nvidia[0-1), mainly because I want to specify thethe cores to use for each gpu.
gres.conf would look something like...
NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
File=/dev/nvidia0 Cores=0
NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
File=/dev/nvidia1 Cores=1
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0 Cores=0
NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1 Cores=1
which can be distributed to all nodes.
-b
On 12/04/2018 09:55 AM, Lou Nicotra wrote:
> Brian, the specific node does not show any gres...
> root at panther02 slurm# scontrol show partition=tiger_1
> PartitionName=tiger_1
> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> AllocNodes=ALL Default=YES QoS=N/A
> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
> Nodes=tiger[01-22]
> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
> OverTimeLimit=NONE PreemptMode=OFF
> State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
> JobDefaults=(null)
> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
> root at panther02 slurm# scontrol show node=tiger11
> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
> CPUAlloc=0 CPUTot=48 CPULoad=11.50
> AvailableFeatures=HyperThread
> ActiveFeatures=HyperThread
> Gres=(null)
> NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
> OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
> RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=tiger_1,compute_1
> BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
> CfgTRES=cpu=48,mem=1M,billing=48
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> So, something is not setup correctly... Could it be a 18.X bug?
>
> Thanks.
>
>
> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra <lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>> wrote:
>
> Thanks Michael. I will try 17.x as I also could not see anything wrong
> with my settings... Will report back afterwards...
>
> Lou
>
> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico <mdidomenico4 at gmail.com
> <mailto:mdidomenico4 at gmail.com>> wrote:
>
> unfortunately, someone smarter then me will have to help further. I'm
> not sure i see anything specifically wrong. The one thing i might try
> is backing the software down to a 17.x release series. I recently
> tried 18.x and had some issues. I can't say whether it'll be any
> different, but you might be exposing an undiagnosed bug in the 18.x
> branch
> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra <lnicotra at interactions.com
> <mailto:lnicotra at interactions.com>> wrote:
> >
> > Made the change in the gres.conf on local server file and restarted
> slurmd and slurmctld on master.... Unfortunately same error...
> >
> > Distributed corrected gres.conf to all k20 servers, restarted slurmd
> and slurmdctl... Still has same error...
> >
> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson <bjohanso at psc.edu
> <mailto:bjohanso at psc.edu>> wrote:
> >>
> >> Is that a lowercase k in k20 specified in the batch script and
> nodename and a uppercase K specified in gres.conf?
> >>
> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
> >>
> >> Hi All, I have recently set up a slurm cluster with my servers and
> I'm running into an issue while submitting GPU jobs. It has something
> to to with gres configurations, but I just can't seem to figure out
> what is wrong. Non GPU jobs run fine.
> >>
> >> The error is as follows:
> >> sbatch: error: Batch job submission failed: Invalid Trackable
> RESource (TRES) specification after submitting a batch job.
> >>
> >> My batch job is as follows:
> >> #!/bin/bash
> >> #SBATCH --partition=tiger_1 # partition name
> >> #SBATCH --gres=gpu:k20:1
> >> #SBATCH --gres-flags=enforce-binding
> >> #SBATCH --time=0:20:00 # wall clock limit
> >> #SBATCH --output=gpu-%J.txt
> >> #SBATCH --account=lnicotra
> >> module load cuda
> >> python gpu1
> >>
> >> Where gpu1 is a GPU test script that runs correctly while invoked
> via python. Tiger_1 partition has servers with GPUs, with a mix of
> 1080GTX and K20 as specified in slurm.conf
> >>
> >> I have defined GRES resources in the slurm.conf file:
> >> # GPU GRES
> >> GresTypes=gpu
> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
> >>
> >> And have a local gres.conf on the servers containing GPUs...
> >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
> >> # GPU Definitions
> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20
> File=/dev/nvidia[0-1]
> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
> >>
> >> and a similar one for the 1080GTX
> >> # GPU Definitions
> >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
> File=/dev/nvidia[0-1]
> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
> >>
> >> The account manager seems to know about the GPUs...
> >> lnicotra at tiger11 ~# sacctmgr show tres
> >> Type Name ID
> >> -------- --------------- ------
> >> cpu 1
> >> mem 2
> >> energy 3
> >> node 4
> >> billing 5
> >> fs disk 6
> >> vmem 7
> >> pages 8
> >> gres gpu 1001
> >> gres gpu:k20 1002
> >> gres gpu:1080gtx 1003
> >>
> >> Can anyone point out what am I missing?
> >>
> >> Thanks!
> >> Lou
> >>
> >>
> >> --
> >>
> >> Lou Nicotra
> >>
> >> IT Systems Engineer - SLT
> >>
> >> Interactions LLC
> >>
> >> o: 908-673-1833
> >>
> >> m: 908-451-6983
> >>
> >> lnicotra at interactions.com <mailto:lnicotra at interactions.com>
> >>
> >> www.interactions.com <http://www.interactions.com>
> >>
> >>
> *******************************************************************************
> >>
> >> This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject
> to copyright belonging to the Interactions LLC. This e-mail is
> intended solely for the use of the individual or entity to which it is
> addressed. If you are not the intended recipient of this e-mail, you
> are hereby notified that any dissemination, distribution, copying, or
> action taken in relation to the contents of and attachments to this
> e-mail is strictly prohibited and may be unlawful. If you have
> received this e-mail in error, please notify the sender immediately
> and permanently delete the original and any copy of this e-mail and
> any printout. Thank You.
> >>
> >>
> *******************************************************************************
> >>
> >>
> >
> >
> > --
> >
> > Lou Nicotra
> >
> > IT Systems Engineer - SLT
> >
> > Interactions LLC
> >
> > o: 908-673-1833
> >
> > m: 908-451-6983
> >
> > lnicotra at interactions.com <mailto:lnicotra at interactions.com>
> >
> > www.interactions.com <http://www.interactions.com>
> >
> >
> *******************************************************************************
> >
> > This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject
> to copyright belonging to the Interactions LLC. This e-mail is
> intended solely for the use of the individual or entity to which it is
> addressed. If you are not the intended recipient of this e-mail, you
> are hereby notified that any dissemination, distribution, copying, or
> action taken in relation to the contents of and attachments to this
> e-mail is strictly prohibited and may be unlawful. If you have
> received this e-mail in error, please notify the sender immediately
> and permanently delete the original and any copy of this e-mail and
> any printout. Thank You.
> >
> >
> *******************************************************************************
>
>
>
> --
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <tel:781-405-5114>
>
> m: 908-451-6983 <tel:781-405-5114>
>
> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
> www.interactions.com <http://www.interactions.com/>
>
>
>
> --
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <tel:781-405-5114>
>
> m: 908-451-6983 <tel:781-405-5114>
>
> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
> www.interactions.com <http://www.interactions.com/>
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to the Interactions LLC. This e-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you are
> not the intended recipient of this e-mail, you are hereby notified that any
> dissemination, distribution, copying, or action taken in relation to the
> contents of and attachments to this e-mail is strictly prohibited and may be
> unlawful. If you have received this e-mail in error, please notify the sender
> immediately and permanently delete the original and any copy of this e-mail
> and any printout. Thank You.
>
> *******************************************************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181204/9b6a962d/attachment-0001.html>
More information about the slurm-users
mailing list