[slurm-users] GRES GPU issues

Michael Di Domenico mdidomenico4 at gmail.com
Mon Dec 3 12:41:14 MST 2018


are you willing to paste an `scontrol show config` from the machine
having trouble
On Mon, Dec 3, 2018 at 12:10 PM Lou Nicotra <lnicotra at interactions.com> wrote:
>
> I'm running  slurmd version 18.08.0...
>
> It seems that the system recognizes the GPUs after a slurmd restart. I tuned debug to 5, restarted and then submitted job. Nothing get logged to log file in local server...
> [2018-12-03T11:55:18.442] Slurmd shutdown completing
> [2018-12-03T11:55:18.484] debug:  Log file re-opened
> [2018-12-03T11:55:18.485] debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
> [2018-12-03T11:55:18.485] Message aggregation disabled
> [2018-12-03T11:55:18.486] debug:  CPUs:48 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:2
> [2018-12-03T11:55:18.486] debug:  init: Gres GPU plugin loaded
> [2018-12-03T11:55:18.486] Gres Name=gpu Type=K20 Count=2
> [2018-12-03T11:55:18.487] gpu device number 0(/dev/nvidia0):c 195:0 rwm
> [2018-12-03T11:55:18.487] gpu device number 1(/dev/nvidia1):c 195:1 rwm
> [2018-12-03T11:55:18.487] topology NONE plugin loaded
> [2018-12-03T11:55:18.487] route default plugin loaded
> [2018-12-03T11:55:18.530] debug:  Resource spec: No specialized cores configured by default on this node
> [2018-12-03T11:55:18.530] debug:  Resource spec: Reserved system memory limit not configured for this node
> [2018-12-03T11:55:18.530] debug:  task NONE plugin loaded
> [2018-12-03T11:55:18.530] debug:  Munge authentication plugin loaded
> [2018-12-03T11:55:18.530] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
> [2018-12-03T11:55:18.530] Munge cryptographic signature plugin loaded
> [2018-12-03T11:55:18.532] slurmd version 18.08.0 started
> [2018-12-03T11:55:18.532] debug:  Job accounting gather LINUX plugin loaded
> [2018-12-03T11:55:18.532] debug:  job_container none plugin loaded
> [2018-12-03T11:55:18.532] debug:  switch NONE plugin loaded
> [2018-12-03T11:55:18.532] slurmd started on Mon, 03 Dec 2018 11:55:18 -0500
> [2018-12-03T11:55:18.533] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=386757 TmpDisk=4758 Uptime=21165906 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
> [2018-12-03T11:55:18.533] debug:  AcctGatherEnergy NONE plugin loaded
> [2018-12-03T11:55:18.533] debug:  AcctGatherProfile NONE plugin loaded
> [2018-12-03T11:55:18.533] debug:  AcctGatherInterconnect NONE plugin loaded
> [2018-12-03T11:55:18.533] debug:  AcctGatherFilesystem NONE plugin loaded
> root at tiger11 slurm#
>
> So, I turned on debug to 5 in slurmcltd in master server, and after I submitted my job, it shows...
> [2018-12-03T12:02:10.355] _job_create: account 'lnicotra' has no association for user 1498 using default account 'slt'
> [2018-12-03T12:02:10.356] _slurm_rpc_submit_batch_job: Invalid Trackable RESource (TRES) specification
>
> So, we use LDAP for authentication and my UID is 1498, but I created a user in slurm using my login name. The default account for all users is "slt"  Is this the cause of my problems?
> root at panther02 slurm# getent passwd lnicotra
> lnicotra:*:1498:1152:Lou Nicotra:/home/lnicotra:/bin/bash
>
> If so, how is this resolved as we use multiple servers and there are no local accounts for them?
>
> Thanks!
> Lou
>
>
>
> On Mon, Dec 3, 2018 at 11:36 AM Michael Di Domenico <mdidomenico4 at gmail.com> wrote:
>>
>> do you get anything additional in the slurm logs?  have you tried
>> adding gres to the debugflags?  what version of slurm are you running?
>> On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra <lnicotra at interactions.com> wrote:
>> >
>> > Hi All, I have recently set up a slurm cluster with my servers and I'm running into an issue while submitting GPU jobs. It has something to to with gres configurations, but I just can't seem to figure out what is wrong. Non GPU jobs run fine.
>> >
>> > The error is as follows:
>> > sbatch: error: Batch job submission failed: Invalid Trackable RESource (TRES) specification  after submitting a batch job.
>> >
>> > My batch job is as follows:
>> > #!/bin/bash
>> > #SBATCH --partition=tiger_1   # partition name
>> > #SBATCH --gres=gpu:k20:1
>> > #SBATCH --gres-flags=enforce-binding
>> > #SBATCH --time=0:20:00  # wall clock limit
>> > #SBATCH --output=gpu-%J.txt
>> > #SBATCH --account=lnicotra
>> > module load cuda
>> > python gpu1
>> >
>> > Where gpu1 is a GPU test script that runs correctly while invoked via python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and K20 as specified in slurm.conf
>> >
>> > I have defined GRES resources in the slurm.conf file:
>> > # GPU GRES
>> > GresTypes=gpu
>> > NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
>> > NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
>> >
>> > And have a local gres.conf on the servers containing GPUs...
>> > lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
>> > # GPU Definitions
>> > # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20 File=/dev/nvidia[0-1]
>> > Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
>> >
>> > and a similar one for the 1080GTX
>> > # GPU Definitions
>> > # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX File=/dev/nvidia[0-1]
>> > Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
>> >
>> > The account manager seems to know about the GPUs...
>> > lnicotra at tiger11 ~# sacctmgr show tres
>> >     Type            Name     ID
>> > -------- --------------- ------
>> >      cpu                      1
>> >      mem                      2
>> >   energy                      3
>> >     node                      4
>> >  billing                      5
>> >       fs            disk      6
>> >     vmem                      7
>> >    pages                      8
>> >     gres             gpu   1001
>> >     gres         gpu:k20   1002
>> >     gres     gpu:1080gtx   1003
>> >
>> > Can anyone point out what am I missing?
>> >
>> > Thanks!
>> > Lou
>> >
>> >
>> > --
>> >
>> > Lou Nicotra
>> >
>> > IT Systems Engineer - SLT
>> >
>> > Interactions LLC
>> >
>> > o:  908-673-1833
>> >
>> > m: 908-451-6983
>> >
>> > lnicotra at interactions.com
>> >
>> > www.interactions.com
>> >
>> > *******************************************************************************
>> >
>> > This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.
>> >
>> > *******************************************************************************
>>
>
>
> --
>
> Lou Nicotra
>
> IT Systems Engineer - SLT
>
> Interactions LLC
>
> o:  908-673-1833
>
> m: 908-451-6983
>
> lnicotra at interactions.com
>
> www.interactions.com
>
> *******************************************************************************
>
> This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.
>
> *******************************************************************************



More information about the slurm-users mailing list