[slurm-users] GRES GPU issues
Lou Nicotra
lnicotra at interactions.com
Tue Dec 4 12:11:26 MST 2018
Brian, I used a single gres.conf file and distributed to all nodes...
Restarted all daemons, unfortunately scontrol still does not show any Gres
resources for GPU nodes...
Will try to roll back to 17.X release. Is it basically a matter of removing
18.x rpms and installing 17's? Does the DB need to be downgraded also?
Thanks...
Lou
On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjohanso at psc.edu> wrote:
>
> Do one more pass through making sure
> s/1080GTX/1080gtx and s/K20/k20
>
> shutdown all slurmd, slurmctld, start slurmctl, start slurmd
>
>
> I find it less confusing to have a global gres.conf file. I haven't used
> a list (nvidia[0-1), mainly because I want to specify the the cores to
> use for each gpu.
>
> gres.conf would look something like...
>
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> File=/dev/nvidia0 Cores=0
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> File=/dev/nvidia1 Cores=1
> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0
> Cores=0
> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1
> Cores=1
>
> which can be distributed to all nodes.
>
> -b
>
>
> On 12/04/2018 09:55 AM, Lou Nicotra wrote:
>
> Brian, the specific node does not show any gres...
> root at panther02 slurm# scontrol show partition=tiger_1
> PartitionName=tiger_1
> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> AllocNodes=ALL Default=YES QoS=N/A
> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
> Nodes=tiger[01-22]
> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
> OverTimeLimit=NONE PreemptMode=OFF
> State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
> JobDefaults=(null)
> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
> root at panther02 slurm# scontrol show node=tiger11
> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
> CPUAlloc=0 CPUTot=48 CPULoad=11.50
> AvailableFeatures=HyperThread
> ActiveFeatures=HyperThread
> Gres=(null)
> NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
> OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
> RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=tiger_1,compute_1
> BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
> CfgTRES=cpu=48,mem=1M,billing=48
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> So, something is not setup correctly... Could it be a 18.X bug?
>
> Thanks.
>
>
> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra <lnicotra at interactions.com>
> wrote:
>
>> Thanks Michael. I will try 17.x as I also could not see anything wrong
>> with my settings... Will report back afterwards...
>>
>> Lou
>>
>> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico <
>> mdidomenico4 at gmail.com> wrote:
>>
>>> unfortunately, someone smarter then me will have to help further. I'm
>>> not sure i see anything specifically wrong. The one thing i might try
>>> is backing the software down to a 17.x release series. I recently
>>> tried 18.x and had some issues. I can't say whether it'll be any
>>> different, but you might be exposing an undiagnosed bug in the 18.x
>>> branch
>>> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra <lnicotra at interactions.com>
>>> wrote:
>>> >
>>> > Made the change in the gres.conf on local server file and restarted
>>> slurmd and slurmctld on master.... Unfortunately same error...
>>> >
>>> > Distributed corrected gres.conf to all k20 servers, restarted slurmd
>>> and slurmdctl... Still has same error...
>>> >
>>> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson <bjohanso at psc.edu>
>>> wrote:
>>> >>
>>> >> Is that a lowercase k in k20 specified in the batch script and
>>> nodename and a uppercase K specified in gres.conf?
>>> >>
>>> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>>> >>
>>> >> Hi All, I have recently set up a slurm cluster with my servers and
>>> I'm running into an issue while submitting GPU jobs. It has something to to
>>> with gres configurations, but I just can't seem to figure out what is
>>> wrong. Non GPU jobs run fine.
>>> >>
>>> >> The error is as follows:
>>> >> sbatch: error: Batch job submission failed: Invalid Trackable
>>> RESource (TRES) specification after submitting a batch job.
>>> >>
>>> >> My batch job is as follows:
>>> >> #!/bin/bash
>>> >> #SBATCH --partition=tiger_1 # partition name
>>> >> #SBATCH --gres=gpu:k20:1
>>> >> #SBATCH --gres-flags=enforce-binding
>>> >> #SBATCH --time=0:20:00 # wall clock limit
>>> >> #SBATCH --output=gpu-%J.txt
>>> >> #SBATCH --account=lnicotra
>>> >> module load cuda
>>> >> python gpu1
>>> >>
>>> >> Where gpu1 is a GPU test script that runs correctly while invoked via
>>> python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and
>>> K20 as specified in slurm.conf
>>> >>
>>> >> I have defined GRES resources in the slurm.conf file:
>>> >> # GPU GRES
>>> >> GresTypes=gpu
>>> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>>> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>>> >>
>>> >> And have a local gres.conf on the servers containing GPUs...
>>> >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>>> >> # GPU Definitions
>>> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20
>>> File=/dev/nvidia[0-1]
>>> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>>> >>
>>> >> and a similar one for the 1080GTX
>>> >> # GPU Definitions
>>> >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
>>> File=/dev/nvidia[0-1]
>>> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>>> >>
>>> >> The account manager seems to know about the GPUs...
>>> >> lnicotra at tiger11 ~# sacctmgr show tres
>>> >> Type Name ID
>>> >> -------- --------------- ------
>>> >> cpu 1
>>> >> mem 2
>>> >> energy 3
>>> >> node 4
>>> >> billing 5
>>> >> fs disk 6
>>> >> vmem 7
>>> >> pages 8
>>> >> gres gpu 1001
>>> >> gres gpu:k20 1002
>>> >> gres gpu:1080gtx 1003
>>> >>
>>> >> Can anyone point out what am I missing?
>>> >>
>>> >> Thanks!
>>> >> Lou
>>> >>
>>> >>
>>> >> --
>>> >>
>>> >> Lou Nicotra
>>> >>
>>> >> IT Systems Engineer - SLT
>>> >>
>>> >> Interactions LLC
>>> >>
>>> >> o: 908-673-1833
>>> >>
>>> >> m: 908-451-6983
>>> >>
>>> >> lnicotra at interactions.com
>>> >>
>>> >> www.interactions.com
>>> >>
>>> >>
>>> *******************************************************************************
>>> >>
>>> >> This e-mail and any of its attachments may contain Interactions LLC
>>> proprietary information, which is privileged, confidential, or subject to
>>> copyright belonging to the Interactions LLC. This e-mail is intended solely
>>> for the use of the individual or entity to which it is addressed. If you
>>> are not the intended recipient of this e-mail, you are hereby notified that
>>> any dissemination, distribution, copying, or action taken in relation to
>>> the contents of and attachments to this e-mail is strictly prohibited and
>>> may be unlawful. If you have received this e-mail in error, please notify
>>> the sender immediately and permanently delete the original and any copy of
>>> this e-mail and any printout. Thank You.
>>> >>
>>> >>
>>> *******************************************************************************
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> >
>>> > Lou Nicotra
>>> >
>>> > IT Systems Engineer - SLT
>>> >
>>> > Interactions LLC
>>> >
>>> > o: 908-673-1833
>>> >
>>> > m: 908-451-6983
>>> >
>>> > lnicotra at interactions.com
>>> >
>>> > www.interactions.com
>>> >
>>> >
>>> *******************************************************************************
>>> >
>>> > This e-mail and any of its attachments may contain Interactions LLC
>>> proprietary information, which is privileged, confidential, or subject to
>>> copyright belonging to the Interactions LLC. This e-mail is intended solely
>>> for the use of the individual or entity to which it is addressed. If you
>>> are not the intended recipient of this e-mail, you are hereby notified that
>>> any dissemination, distribution, copying, or action taken in relation to
>>> the contents of and attachments to this e-mail is strictly prohibited and
>>> may be unlawful. If you have received this e-mail in error, please notify
>>> the sender immediately and permanently delete the original and any copy of
>>> this e-mail and any printout. Thank You.
>>> >
>>> >
>>> *******************************************************************************
>>>
>>>
>>
>> --
>>
>> *Lou Nicotra*
>>
>> IT Systems Engineer - SLT
>>
>> Interactions LLC
>>
>> o: 908-673-1833 <781-405-5114>
>>
>> m: 908-451-6983 <781-405-5114>
>>
>> *lnicotra at interactions.com <lnicotra at interactions.com>*
>> www.interactions.com
>>
>
>
> --
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <781-405-5114>
>
> m: 908-451-6983 <781-405-5114>
>
> *lnicotra at interactions.com <lnicotra at interactions.com>*
> www.interactions.com
>
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to the Interactions LLC. This e-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this e-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this e-mail is strictly prohibited and
> may be unlawful. If you have received this e-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this e-mail and any printout. Thank You.
>
>
> *******************************************************************************
>
>
>
>
--
*Lou Nicotra*
IT Systems Engineer - SLT
Interactions LLC
o: 908-673-1833 <781-405-5114>
m: 908-451-6983 <781-405-5114>
*lnicotra at interactions.com <lnicotra at interactions.com>*
www.interactions.com
--
*******************************************************************************
This e-mail and any of its attachments may contain
Interactions LLC
proprietary information, which is privileged,
confidential, or subject to
copyright belonging to the Interactions
LLC. This e-mail is intended solely
for the use of the individual or
entity to which it is addressed. If you
are not the intended recipient of this
e-mail, you are hereby notified that
any dissemination, distribution, copying,
or action taken in relation to
the contents of and attachments to this e-mail
is strictly prohibited and
may be unlawful. If you have received this e-mail in
error, please notify
the sender immediately and permanently delete the original
and any copy of
this e-mail and any printout. Thank You.
*******************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181204/53d8f914/attachment-0001.html>
More information about the slurm-users
mailing list