[slurm-users] GPU gres error for 1 of 3 GPU types
Sean McGrath
smcgrat at tchpc.tcd.ie
Mon Jan 14 15:19:56 UTC 2019
We managed to resolve this as follows:
gres.conf changes:
-NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0
-NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1
+NodeName=boole-n024 Name=gpu Type=rtx2080ti File=/dev/nvidia0
+NodeName=boole-n024 Name=gpu Type=rtx2080ti File=/dev/nvidia1
slurm.conf
-PartitionName=long Nodes=boole-n[001-006],boole-n017,boole-n[018-020],boole-n[025] MaxTime=10-00:00:00 State=UP Sha
+PartitionName=long Nodes=boole-n[001-006],boole-n[016,017],boole-n[018-020],boole-n[025] MaxTime=10-00:00:00 State=
So it seems that starting the name of the gres type with a number instead of a
letter is what the problem was.
Thanks
Sean
On Fri, Jan 11, 2019 at 04:51:38PM +0000, Sean McGrath wrote:
> I forgot to mention we are running slurm version 18.08.3 before.
>
> On Fri, Jan 11, 2019 at 10:35:09AM -0500, Paul Edmon wrote:
>
> > I'm pretty sure that gres.conf has to be on all the nodes as well
> > and not just the master.
>
> Thanks Paul. We deploy the same slurm configuration, including the gres.conf
> file, cluster wide. I've double checked the node in question and it has the
> correct gres.conf.
>
> Best
>
> Sean
>
> >
> > -Paul Edmon-
> >
> > On 1/11/19 5:21 AM, Sean McGrath wrote:
> > >Hi everyone,
> > >
> > >Your help for this would be much appreciated please.
> > >
> > >We have a cluster with 3 types of gpu configured in gres. Users can successfully
> > >request 2 of the gpu types but the third errors when requested.
> > >
> > >Here is the successful salloc behaviour:
> > >
> > >root at boole01:/etc/slurm # salloc --gres=gpu:tesla:1 -N 1
> > >salloc: Granted job allocation 271558
> > >[root at boole-n019:/etc/slurm]# exit
> > >salloc: Relinquishing job allocation 271558
> > >root at boole01:/etc/slurm # salloc --gres=gpu:volta:1 -N 1
> > >salloc: Pending job allocation 271559
> > >salloc: job 271559 queued and waiting for resources
> > >^Csalloc: Job allocation 271559 has been revoked.
> > >
> > >And the unsuccessful salloc behaviour:
> > >
> > >root at boole01:/etc/slurm # salloc --gres=gpu:2080ti:1 -N 1
> > >salloc: error: Job submit/allocate failed: Invalid generic resource (gres)
> > >specification
> > >
> > >Slurm.log output for successful salloc's:
> > >
> > >[2019-01-11T10:13:36.434] sched: _slurm_rpc_allocate_resources JobId=271558
> > >NodeList=boole-n019 usec=30495
> > >[2019-01-11T10:13:42.485] _job_complete: JobId=271558 WEXITSTATUS 0
> > >[2019-01-11T10:13:42.486] _job_complete: JobId=271558 done
> > >[2019-01-11T10:13:46.000] sched: _slurm_rpc_allocate_resources JobId=271559
> > >NodeList=(null) usec=15674
> > >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 WTERMSIG 126
> > >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 cancelled by interactive
> > >user
> > >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 done
> > >
> > >Slurm.log output for unsuccessful salloc's:
> > >
> > >[2019-01-11T10:13:55.755] _get_next_job_gres: Invalid GRES job specification
> > >gpu:2080ti:1
> > >[2019-01-11T10:13:55.755] _slurm_rpc_allocate_resources: Invalid generic
> > >resource (gres) specification
> > >
> > >
> > >Slurm gres configuration:
> > >
> > >root at boole01:/etc/slurm # grep -i gres slurm.conf | grep -v ^#
> > >GresTypes=gpu,mic
> > >NodeName=boole-n[018-023] Gres=gpu:tesla:2 RealMemory=256000 Sockets=2
> > >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=50
> > >NodeName=boole-n024 Gres=gpu:2080ti:2 RealMemory=256000 Sockets=2
> > >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=100
> > >NodeName=boole-n016 Gres=gpu:volta:2 RealMemory=256000 Sockets=2
> > >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=200
> > >
> > >gres.conf:
> > >
> > >root at boole01:/etc/slurm # cat gres.conf
> > >NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia0
> > >NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia1
> > >NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0
> > >NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1
> > >NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia0
> > >NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia1
> > >#NodeName=boole-n017 Name=mic File=/dev/mic0
> > >#NodeName=boole-n017 Name=mic File=/dev/mic1
> > >
> > >Please let me know if there is anymore info that would be helpful for this?
> > >
> > >What am I missing or doing wrong?
> > >
> > >Many thanks in advance.
> > >
> > >Sean
> > >
> > >
> >
>
> --
> Sean McGrath M.Sc
>
> Systems Administrator
> Trinity Centre for High Performance and Research Computing
> Trinity College Dublin
>
> sean.mcgrath at tchpc.tcd.ie
>
> https://www.tcd.ie/
> https://www.tchpc.tcd.ie/
>
> +353 (0) 1 896 3725
>
>
--
Sean McGrath M.Sc
Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin
sean.mcgrath at tchpc.tcd.ie
https://www.tcd.ie/
https://www.tchpc.tcd.ie/
+353 (0) 1 896 3725
More information about the slurm-users
mailing list