[slurm-users] GRES GPU issues

Lou Nicotra lnicotra at interactions.com
Tue Dec 4 07:55:13 MST 2018


Brian, the specific node does not show any gres...
root at panther02 slurm# scontrol show partition=tiger_1
PartitionName=tiger_1
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
   Nodes=tiger[01-22]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

root at panther02 slurm#  scontrol show node=tiger11
NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUTot=48 CPULoad=11.50
   AvailableFeatures=HyperThread
   ActiveFeatures=HyperThread
   Gres=(null)
   NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
   OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
   RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=tiger_1,compute_1
   BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
   CfgTRES=cpu=48,mem=1M,billing=48
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

So, something is not setup correctly... Could it be a 18.X bug?

Thanks.


On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra <lnicotra at interactions.com>
wrote:

> Thanks Michael. I will try 17.x as I also could not see anything wrong
> with my settings... Will report back afterwards...
>
> Lou
>
> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico <mdidomenico4 at gmail.com>
> wrote:
>
>> unfortunately, someone smarter then me will have to help further.  I'm
>> not sure i see anything specifically wrong.  The one thing i might try
>> is backing the software down to a 17.x release series.  I recently
>> tried 18.x and had some issues.  I can't say whether it'll be any
>> different, but you might be exposing an undiagnosed bug in the 18.x
>> branch
>> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra <lnicotra at interactions.com>
>> wrote:
>> >
>> > Made the change in the gres.conf on local server file and restarted
>> slurmd and slurmctld on master.... Unfortunately same error...
>> >
>> > Distributed corrected gres.conf to all k20 servers, restarted slurmd
>> and slurmdctl...   Still has same error...
>> >
>> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson <bjohanso at psc.edu>
>> wrote:
>> >>
>> >> Is that a lowercase k in k20 specified in the batch script and
>> nodename and a uppercase K specified in gres.conf?
>> >>
>> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>> >>
>> >> Hi All, I have recently set up a slurm cluster with my servers and I'm
>> running into an issue while submitting GPU jobs. It has something to to
>> with gres configurations, but I just can't seem to figure out what is
>> wrong. Non GPU jobs run fine.
>> >>
>> >> The error is as follows:
>> >> sbatch: error: Batch job submission failed: Invalid Trackable RESource
>> (TRES) specification  after submitting a batch job.
>> >>
>> >> My batch job is as follows:
>> >> #!/bin/bash
>> >> #SBATCH --partition=tiger_1   # partition name
>> >> #SBATCH --gres=gpu:k20:1
>> >> #SBATCH --gres-flags=enforce-binding
>> >> #SBATCH --time=0:20:00  # wall clock limit
>> >> #SBATCH --output=gpu-%J.txt
>> >> #SBATCH --account=lnicotra
>> >> module load cuda
>> >> python gpu1
>> >>
>> >> Where gpu1 is a GPU test script that runs correctly while invoked via
>> python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and
>> K20 as specified in slurm.conf
>> >>
>> >> I have defined GRES resources in the slurm.conf file:
>> >> # GPU GRES
>> >> GresTypes=gpu
>> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>> >>
>> >> And have a local gres.conf on the servers containing GPUs...
>> >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>> >> # GPU Definitions
>> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20
>> File=/dev/nvidia[0-1]
>> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>> >>
>> >> and a similar one for the 1080GTX
>> >> # GPU Definitions
>> >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
>> File=/dev/nvidia[0-1]
>> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>> >>
>> >> The account manager seems to know about the GPUs...
>> >> lnicotra at tiger11 ~# sacctmgr show tres
>> >>     Type            Name     ID
>> >> -------- --------------- ------
>> >>      cpu                      1
>> >>      mem                      2
>> >>   energy                      3
>> >>     node                      4
>> >>  billing                      5
>> >>       fs            disk      6
>> >>     vmem                      7
>> >>    pages                      8
>> >>     gres             gpu   1001
>> >>     gres         gpu:k20   1002
>> >>     gres     gpu:1080gtx   1003
>> >>
>> >> Can anyone point out what am I missing?
>> >>
>> >> Thanks!
>> >> Lou
>> >>
>> >>
>> >> --
>> >>
>> >> Lou Nicotra
>> >>
>> >> IT Systems Engineer - SLT
>> >>
>> >> Interactions LLC
>> >>
>> >> o:  908-673-1833
>> >>
>> >> m: 908-451-6983
>> >>
>> >> lnicotra at interactions.com
>> >>
>> >> www.interactions.com
>> >>
>> >>
>> *******************************************************************************
>> >>
>> >> This e-mail and any of its attachments may contain Interactions LLC
>> proprietary information, which is privileged, confidential, or subject to
>> copyright belonging to the Interactions LLC. This e-mail is intended solely
>> for the use of the individual or entity to which it is addressed. If you
>> are not the intended recipient of this e-mail, you are hereby notified that
>> any dissemination, distribution, copying, or action taken in relation to
>> the contents of and attachments to this e-mail is strictly prohibited and
>> may be unlawful. If you have received this e-mail in error, please notify
>> the sender immediately and permanently delete the original and any copy of
>> this e-mail and any printout. Thank You.
>> >>
>> >>
>> *******************************************************************************
>> >>
>> >>
>> >
>> >
>> > --
>> >
>> > Lou Nicotra
>> >
>> > IT Systems Engineer - SLT
>> >
>> > Interactions LLC
>> >
>> > o:  908-673-1833
>> >
>> > m: 908-451-6983
>> >
>> > lnicotra at interactions.com
>> >
>> > www.interactions.com
>> >
>> >
>> *******************************************************************************
>> >
>> > This e-mail and any of its attachments may contain Interactions LLC
>> proprietary information, which is privileged, confidential, or subject to
>> copyright belonging to the Interactions LLC. This e-mail is intended solely
>> for the use of the individual or entity to which it is addressed. If you
>> are not the intended recipient of this e-mail, you are hereby notified that
>> any dissemination, distribution, copying, or action taken in relation to
>> the contents of and attachments to this e-mail is strictly prohibited and
>> may be unlawful. If you have received this e-mail in error, please notify
>> the sender immediately and permanently delete the original and any copy of
>> this e-mail and any printout. Thank You.
>> >
>> >
>> *******************************************************************************
>>
>>
>
> --
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o:  908-673-1833 <781-405-5114>
>
> m: 908-451-6983 <781-405-5114>
>
> *lnicotra at interactions.com <lnicotra at interactions.com>*
> www.interactions.com
>


-- 

*Lou Nicotra*

IT Systems Engineer - SLT

Interactions LLC

o:  908-673-1833 <781-405-5114>

m: 908-451-6983 <781-405-5114>

*lnicotra at interactions.com <lnicotra at interactions.com>*
www.interactions.com

-- 





*******************************************************************************




This e-mail and any of its attachments may contain
Interactions LLC 
proprietary information, which is privileged,
confidential, or subject to 
copyright belonging to the Interactions
LLC. This e-mail is intended solely 
for the use of the individual or
entity to which it is addressed. If you 
are not the intended recipient of this
e-mail, you are hereby notified that 
any dissemination, distribution, copying,
or action taken in relation to 
the contents of and attachments to this e-mail
is strictly prohibited and 
may be unlawful. If you have received this e-mail in
error, please notify 
the sender immediately and permanently delete the original
and any copy of 
this e-mail and any printout. Thank You.  




******************************************************************************* 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181204/9ba35247/attachment.html>


More information about the slurm-users mailing list