[slurm-users] GRES GPU issues
Lou Nicotra
lnicotra at interactions.com
Mon Dec 3 10:07:42 MST 2018
I'm running slurmd version 18.08.0...
It seems that the system recognizes the GPUs after a slurmd restart. I
tuned debug to 5, restarted and then submitted job. Nothing get logged to
log file in local server...
[2018-12-03T11:55:18.442] Slurmd shutdown completing
[2018-12-03T11:55:18.484] debug: Log file re-opened
[2018-12-03T11:55:18.485] debug: CPUs:48 Boards:1 Sockets:2
CoresPerSocket:12 ThreadsPerCore:2
[2018-12-03T11:55:18.485] Message aggregation disabled
[2018-12-03T11:55:18.486] debug: CPUs:48 Boards:1 Sockets:2
CoresPerSocket:12 ThreadsPerCore:2
[2018-12-03T11:55:18.486] debug: init: Gres GPU plugin loaded
[2018-12-03T11:55:18.486] Gres Name=gpu Type=K20 Count=2
[2018-12-03T11:55:18.487] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2018-12-03T11:55:18.487] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2018-12-03T11:55:18.487] topology NONE plugin loaded
[2018-12-03T11:55:18.487] route default plugin loaded
[2018-12-03T11:55:18.530] debug: Resource spec: No specialized cores
configured by default on this node
[2018-12-03T11:55:18.530] debug: Resource spec: Reserved system memory
limit not configured for this node
[2018-12-03T11:55:18.530] debug: task NONE plugin loaded
[2018-12-03T11:55:18.530] debug: Munge authentication plugin loaded
[2018-12-03T11:55:18.530] debug: spank: opening plugin stack
/etc/slurm/plugstack.conf
[2018-12-03T11:55:18.530] Munge cryptographic signature plugin loaded
[2018-12-03T11:55:18.532] slurmd version 18.08.0 started
[2018-12-03T11:55:18.532] debug: Job accounting gather LINUX plugin loaded
[2018-12-03T11:55:18.532] debug: job_container none plugin loaded
[2018-12-03T11:55:18.532] debug: switch NONE plugin loaded
[2018-12-03T11:55:18.532] slurmd started on Mon, 03 Dec 2018 11:55:18 -0500
[2018-12-03T11:55:18.533] CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2
Memory=386757 TmpDisk=4758 Uptime=21165906 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
[2018-12-03T11:55:18.533] debug: AcctGatherEnergy NONE plugin loaded
[2018-12-03T11:55:18.533] debug: AcctGatherProfile NONE plugin loaded
[2018-12-03T11:55:18.533] debug: AcctGatherInterconnect NONE plugin loaded
[2018-12-03T11:55:18.533] debug: AcctGatherFilesystem NONE plugin loaded
root at tiger11 slurm#
So, I turned on debug to 5 in slurmcltd in master server, and after I
submitted my job, it shows...
[2018-12-03T12:02:10.355] _job_create: account 'lnicotra' has no
association for user 1498 using default account 'slt'
[2018-12-03T12:02:10.356] _slurm_rpc_submit_batch_job: Invalid Trackable
RESource (TRES) specification
So, we use LDAP for authentication and my UID is 1498, but I created a user
in slurm using my login name. The default account for all users is "slt"
Is this the cause of my problems?
root at panther02 slurm# getent passwd lnicotra
lnicotra:*:1498:1152:Lou Nicotra:/home/lnicotra:/bin/bash
If so, how is this resolved as we use multiple servers and there are no
local accounts for them?
Thanks!
Lou
On Mon, Dec 3, 2018 at 11:36 AM Michael Di Domenico <mdidomenico4 at gmail.com>
wrote:
> do you get anything additional in the slurm logs? have you tried
> adding gres to the debugflags? what version of slurm are you running?
> On Mon, Dec 3, 2018 at 9:18 AM Lou Nicotra <lnicotra at interactions.com>
> wrote:
> >
> > Hi All, I have recently set up a slurm cluster with my servers and I'm
> running into an issue while submitting GPU jobs. It has something to to
> with gres configurations, but I just can't seem to figure out what is
> wrong. Non GPU jobs run fine.
> >
> > The error is as follows:
> > sbatch: error: Batch job submission failed: Invalid Trackable RESource
> (TRES) specification after submitting a batch job.
> >
> > My batch job is as follows:
> > #!/bin/bash
> > #SBATCH --partition=tiger_1 # partition name
> > #SBATCH --gres=gpu:k20:1
> > #SBATCH --gres-flags=enforce-binding
> > #SBATCH --time=0:20:00 # wall clock limit
> > #SBATCH --output=gpu-%J.txt
> > #SBATCH --account=lnicotra
> > module load cuda
> > python gpu1
> >
> > Where gpu1 is a GPU test script that runs correctly while invoked via
> python. Tiger_1 partition has servers with GPUs, with a mix of 1080GTX and
> K20 as specified in slurm.conf
> >
> > I have defined GRES resources in the slurm.conf file:
> > # GPU GRES
> > GresTypes=gpu
> > NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
> > NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
> >
> > And have a local gres.conf on the servers containing GPUs...
> > lnicotra at tiger11 ~# cat /etc/slurm/gres.conf
> > # GPU Definitions
> > # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=K20
> File=/dev/nvidia[0-1]
> > Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
> >
> > and a similar one for the 1080GTX
> > # GPU Definitions
> > # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
> File=/dev/nvidia[0-1]
> > Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
> >
> > The account manager seems to know about the GPUs...
> > lnicotra at tiger11 ~# sacctmgr show tres
> > Type Name ID
> > -------- --------------- ------
> > cpu 1
> > mem 2
> > energy 3
> > node 4
> > billing 5
> > fs disk 6
> > vmem 7
> > pages 8
> > gres gpu 1001
> > gres gpu:k20 1002
> > gres gpu:1080gtx 1003
> >
> > Can anyone point out what am I missing?
> >
> > Thanks!
> > Lou
> >
> >
> > --
> >
> > Lou Nicotra
> >
> > IT Systems Engineer - SLT
> >
> > Interactions LLC
> >
> > o: 908-673-1833
> >
> > m: 908-451-6983
> >
> > lnicotra at interactions.com
> >
> > www.interactions.com
> >
> >
> *******************************************************************************
> >
> > This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to the Interactions LLC. This e-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this e-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this e-mail is strictly prohibited and
> may be unlawful. If you have received this e-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this e-mail and any printout. Thank You.
> >
> >
> *******************************************************************************
>
>
--
*Lou Nicotra*
IT Systems Engineer - SLT
Interactions LLC
o: 908-673-1833 <781-405-5114>
m: 908-451-6983 <781-405-5114>
*lnicotra at interactions.com <lnicotra at interactions.com>*
www.interactions.com
--
*******************************************************************************
This e-mail and any of its attachments may contain
Interactions LLC
proprietary information, which is privileged,
confidential, or subject to
copyright belonging to the Interactions
LLC. This e-mail is intended solely
for the use of the individual or
entity to which it is addressed. If you
are not the intended recipient of this
e-mail, you are hereby notified that
any dissemination, distribution, copying,
or action taken in relation to
the contents of and attachments to this e-mail
is strictly prohibited and
may be unlawful. If you have received this e-mail in
error, please notify
the sender immediately and permanently delete the original
and any copy of
this e-mail and any printout. Thank You.
*******************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181203/a91be5fa/attachment-0001.html>
More information about the slurm-users
mailing list