[slurm-users] GRES GPU issues
Brian W. Johanson
bjohanso at psc.edu
Tue Dec 4 16:36:12 MST 2018
Only thing to suggest once again is increasing the logging of both slurmctl and
slurmd.
As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a db
built with 18.x. I imagine there are enough changes there to cause trouble.
I don't imagine downgrading will fix your issue, if you are running 18.08.0, the
most recent release is 18.08.3. NEWS packed in the tarballs gives the fixes in
the versions. I don't see any that would fit you case.
On 12/04/2018 02:11 PM, Lou Nicotra wrote:
> Brian, I used a single gres.conf file and distributed to all nodes...
> Restarted all daemons, unfortunately scontrol still does not show any Gres
> resources for GPU nodes...
>
> Will try to roll back to 17.X release. Is it basically a matter of removing
> 18.x rpms and installing 17's? Does the DB need to be downgraded also?
>
> Thanks...
> Lou
>
> On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjohanso at psc.edu
> <mailto:bjohanso at psc.edu>> wrote:
>
>
> Do one more pass through making sure
> s/1080GTX/1080gtx and s/K20/k20
>
> shutdown all slurmd, slurmctld, start slurmctl, start slurmd
>
>
> I find it less confusing to have a global gres.conf file. I haven't used
> a list (nvidia[0-1), mainly because I want to specify thethe cores to use
> for each gpu.
>
> gres.conf would look something like...
>
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> File=/dev/nvidia0 Cores=0
> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
> File=/dev/nvidia1 Cores=1
> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0 Cores=0
> NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1 Cores=1
>
> which can be distributed to all nodes.
>
> -b
>
>
> On 12/04/2018 09:55 AM, Lou Nicotra wrote:
>> Brian, the specific node does not show any gres...
>> root at panther02 slurm# scontrol show partition=tiger_1
>> PartitionName=tiger_1
>> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>> AllocNodes=ALL Default=YES QoS=N/A
>> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
>> MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
>> MaxCPUsPerNode=UNLIMITED
>> Nodes=tiger[01-22]
>> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
>> OverTimeLimit=NONE PreemptMode=OFF
>> State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
>> JobDefaults=(null)
>> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>>
>> root at panther02 slurm# scontrol show node=tiger11
>> NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
>> CPUAlloc=0 CPUTot=48 CPULoad=11.50
>> AvailableFeatures=HyperThread
>> ActiveFeatures=HyperThread
>> Gres=(null)
>> NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
>> OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
>> RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=tiger_1,compute_1
>> BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
>> CfgTRES=cpu=48,mem=1M,billing=48
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>> So, something is not setup correctly... Could it be a 18.X bug?
>>
>> Thanks.
>>
>>
>> On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra <lnicotra at interactions.com
>> <mailto:lnicotra at interactions.com>> wrote:
>>
>> Thanks Michael. I will try 17.x as I also could not see anything
>> wrong with my settings... Will report back afterwards...
>>
>> Lou
>>
>> On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
>> <mdidomenico4 at gmail.com <mailto:mdidomenico4 at gmail.com>> wrote:
>>
>> unfortunately, someone smarter then me will have to help
>> further. I'm
>> not sure i see anything specifically wrong. The one thing i
>> might try
>> is backing the software down to a 17.x release series. I recently
>> tried 18.x and had some issues. I can't say whether it'll be any
>> different, but you might be exposing an undiagnosed bug in the 18.x
>> branch
>> On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
>> <lnicotra at interactions.com <mailto:lnicotra at interactions.com>> wrote:
>> >
>> > Made the change in the gres.conf on local server file and
>> restarted slurmd and slurmctld on master.... Unfortunately same
>> error...
>> >
>> > Distributed corrected gres.conf to all k20 servers, restarted
>> slurmd and slurmdctl... Still has same error...
>> >
>> > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
>> <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
>> >>
>> >> Is that a lowercase k in k20 specified in the batch script and
>> nodename and a uppercase K specified in gres.conf?
>> >>
>> >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
>> >>
>> >> Hi All, I have recently set up a slurm cluster with my servers
>> and I'm running into an issue while submitting GPU jobs. It has
>> something to to with gres configurations, but I just can't seem
>> to figure out what is wrong. Non GPU jobs run fine.
>> >>
>> >> The error is as follows:
>> >> sbatch: error: Batch job submission failed: Invalid Trackable
>> RESource (TRES) specification after submitting a batch job.
>> >>
>> >> My batch job is as follows:
>> >> #!/bin/bash
>> >> #SBATCH --partition=tiger_1 # partition name
>> >> #SBATCH --gres=gpu:k20:1
>> >> #SBATCH --gres-flags=enforce-binding
>> >> #SBATCH --time=0:20:00 # wall clock limit
>> >> #SBATCH --output=gpu-%J.txt
>> >> #SBATCH --account=lnicotra
>> >> module load cuda
>> >> python gpu1
>> >>
>> >> Where gpu1 is a GPU test script that runs correctly while
>> invoked via python. Tiger_1 partition has servers with GPUs, with
>> a mix of 1080GTX and K20 as specified in slurm.conf
>> >>
>> >> I have defined GRES resources in the slurm.conf file:
>> >> # GPU GRES
>> >> GresTypes=gpu
>> >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>> >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>> >>
>> >> And have a local gres.conf on the servers containing GPUs...
>> >> lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>> >> # GPU Definitions
>> >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu
>> Type=K20 File=/dev/nvidia[0-1]
>> >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>> >>
>> >> and a similar one for the 1080GTX
>> >> # GPU Definitions
>> >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
>> File=/dev/nvidia[0-1]
>> >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>> >>
>> >> The account manager seems to know about the GPUs...
>> >> lnicotra at tiger11 ~# sacctmgr show tres
>> >> Type Name ID
>> >> -------- --------------- ------
>> >> cpu 1
>> >> mem 2
>> >> energy 3
>> >> node 4
>> >> billing 5
>> >> fs disk 6
>> >> vmem 7
>> >> pages 8
>> >> gres gpu 1001
>> >> gres gpu:k20 1002
>> >> gres gpu:1080gtx 1003
>> >>
>> >> Can anyone point out what am I missing?
>> >>
>> >> Thanks!
>> >> Lou
>> >>
>> >>
>> >> --
>> >>
>> >> Lou Nicotra
>> >>
>> >> IT Systems Engineer - SLT
>> >>
>> >> Interactions LLC
>> >>
>> >> o: 908-673-1833
>> >>
>> >> m: 908-451-6983
>> >>
>> >> lnicotra at interactions.com <mailto:lnicotra at interactions.com>
>> >>
>> >> www.interactions.com <http://www.interactions.com>
>> >>
>> >>
>> *******************************************************************************
>> >>
>> >> This e-mail and any of its attachments may contain
>> Interactions LLC proprietary information, which is privileged,
>> confidential, or subject to copyright belonging to the
>> Interactions LLC. This e-mail is intended solely for the use of
>> the individual or entity to which it is addressed. If you are not
>> the intended recipient of this e-mail, you are hereby notified
>> that any dissemination, distribution, copying, or action taken in
>> relation to the contents of and attachments to this e-mail is
>> strictly prohibited and may be unlawful. If you have received
>> this e-mail in error, please notify the sender immediately and
>> permanently delete the original and any copy of this e-mail and
>> any printout. Thank You.
>> >>
>> >>
>> *******************************************************************************
>> >>
>> >>
>> >
>> >
>> > --
>> >
>> > Lou Nicotra
>> >
>> > IT Systems Engineer - SLT
>> >
>> > Interactions LLC
>> >
>> > o: 908-673-1833
>> >
>> > m: 908-451-6983
>> >
>> > lnicotra at interactions.com <mailto:lnicotra at interactions.com>
>> >
>> > www.interactions.com <http://www.interactions.com>
>> >
>> >
>> *******************************************************************************
>> >
>> > This e-mail and any of its attachments may contain Interactions
>> LLC proprietary information, which is privileged, confidential,
>> or subject to copyright belonging to the Interactions LLC. This
>> e-mail is intended solely for the use of the individual or entity
>> to which it is addressed. If you are not the intended recipient
>> of this e-mail, you are hereby notified that any dissemination,
>> distribution, copying, or action taken in relation to the
>> contents of and attachments to this e-mail is strictly prohibited
>> and may be unlawful. If you have received this e-mail in error,
>> please notify the sender immediately and permanently delete the
>> original and any copy of this e-mail and any printout. Thank You.
>> >
>> >
>> *******************************************************************************
>>
>>
>>
>> --
>>
>> *Lou Nicotra*
>>
>> IT Systems Engineer - SLT
>>
>> Interactions LLC
>>
>> o: 908-673-1833 <tel:781-405-5114>
>>
>> m: 908-451-6983 <tel:781-405-5114>
>>
>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>
>> www.interactions.com <http://www.interactions.com/>
>>
>>
>>
>> --
>>
>> *Lou Nicotra*
>>
>> IT Systems Engineer - SLT
>>
>> Interactions LLC
>>
>> o: 908-673-1833 <tel:781-405-5114>
>>
>> m: 908-451-6983 <tel:781-405-5114>
>>
>> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>>
>> www.interactions.com <http://www.interactions.com/>
>>
>> *******************************************************************************
>>
>> This e-mail and any of its attachments may contain Interactions LLC
>> proprietary information, which is privileged, confidential, or subject to
>> copyright belonging to the Interactions LLC. This e-mail is intended
>> solely for the use of the individual or entity to which it is addressed.
>> If you are not the intended recipient of this e-mail, you are hereby
>> notified that any dissemination, distribution, copying, or action taken
>> in relation to the contents of and attachments to this e-mail is strictly
>> prohibited and may be unlawful. If you have received this e-mail in
>> error, please notify the sender immediately and permanently delete the
>> original and any copy of this e-mail and any printout. Thank You.
>>
>> *******************************************************************************
>>
>
>
>
> --
>
> *Lou Nicotra*
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o: 908-673-1833 <tel:781-405-5114>
>
> m: 908-451-6983 <tel:781-405-5114>
>
> _lnicotra at interactions.com <mailto:lnicotra at interactions.com>_
>
> www.interactions.com <http://www.interactions.com/>
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to the Interactions LLC. This e-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you are
> not the intended recipient of this e-mail, you are hereby notified that any
> dissemination, distribution, copying, or action taken in relation to the
> contents of and attachments to this e-mail is strictly prohibited and may be
> unlawful. If you have received this e-mail in error, please notify the sender
> immediately and permanently delete the original and any copy of this e-mail
> and any printout. Thank You.
>
> *******************************************************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181204/7569e29c/attachment-0001.html>
More information about the slurm-users
mailing list