[slurm-users] GPU gres error for 1 of 3 GPU types

Sean McGrath smcgrat at tchpc.tcd.ie
Fri Jan 11 16:51:38 UTC 2019


I forgot to mention we are running slurm version 18.08.3 before.

On Fri, Jan 11, 2019 at 10:35:09AM -0500, Paul Edmon wrote:

> I'm pretty sure that gres.conf has to be on all the nodes as well
> and not just the master.

Thanks Paul. We deploy the same slurm configuration, including the gres.conf
file, cluster wide. I've double checked the node in question and it has the
correct gres.conf.

Best

Sean

> 
> -Paul Edmon-
> 
> On 1/11/19 5:21 AM, Sean McGrath wrote:
> >Hi everyone,
> >
> >Your help for this would be much appreciated please.
> >
> >We have a cluster with 3 types of gpu configured in gres. Users can successfully
> >request 2 of the gpu types but the third errors when requested.
> >
> >Here is the successful salloc behaviour:
> >
> >root at boole01:/etc/slurm # salloc --gres=gpu:tesla:1 -N 1
> >salloc: Granted job allocation 271558
> >[root at boole-n019:/etc/slurm]# exit
> >salloc: Relinquishing job allocation 271558
> >root at boole01:/etc/slurm # salloc --gres=gpu:volta:1 -N 1
> >salloc: Pending job allocation 271559
> >salloc: job 271559 queued and waiting for resources
> >^Csalloc: Job allocation 271559 has been revoked.
> >salloc: Job aborted due to signal
> >
> >And the unsuccessful salloc behaviour:
> >
> >root at boole01:/etc/slurm # salloc --gres=gpu:2080ti:1 -N 1
> >salloc: error: Job submit/allocate failed: Invalid generic resource (gres)
> >specification
> >
> >Slurm.log output for successful salloc's:
> >
> >[2019-01-11T10:13:36.434] sched: _slurm_rpc_allocate_resources JobId=271558
> >NodeList=boole-n019 usec=30495
> >[2019-01-11T10:13:42.485] _job_complete: JobId=271558 WEXITSTATUS 0
> >[2019-01-11T10:13:42.486] _job_complete: JobId=271558 done
> >[2019-01-11T10:13:46.000] sched: _slurm_rpc_allocate_resources JobId=271559
> >NodeList=(null) usec=15674
> >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 WTERMSIG 126
> >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 cancelled by interactive
> >user
> >[2019-01-11T10:13:48.778] _job_complete: JobId=271559 done
> >
> >Slurm.log output for unsuccessful salloc's:
> >
> >[2019-01-11T10:13:55.755] _get_next_job_gres: Invalid GRES job specification
> >gpu:2080ti:1
> >[2019-01-11T10:13:55.755] _slurm_rpc_allocate_resources: Invalid generic
> >resource (gres) specification
> >
> >
> >Slurm gres configuration:
> >
> >root at boole01:/etc/slurm # grep -i gres slurm.conf | grep -v ^#
> >GresTypes=gpu,mic
> >NodeName=boole-n[018-023] Gres=gpu:tesla:2 RealMemory=256000 Sockets=2
> >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=50
> >NodeName=boole-n024 Gres=gpu:2080ti:2 RealMemory=256000 Sockets=2
> >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=100
> >NodeName=boole-n016 Gres=gpu:volta:2 RealMemory=256000 Sockets=2
> >CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=200
> >
> >gres.conf:
> >
> >root at boole01:/etc/slurm # cat gres.conf
> >NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia0
> >NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia1
> >NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0
> >NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1
> >NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia0
> >NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia1
> >#NodeName=boole-n017 Name=mic File=/dev/mic0
> >#NodeName=boole-n017 Name=mic File=/dev/mic1
> >
> >Please let me know if there is anymore info that would be helpful for this?
> >
> >What am I missing or doing wrong?
> >
> >Many thanks in advance.
> >
> >Sean
> >
> >
> 

-- 
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin

sean.mcgrath at tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725




More information about the slurm-users mailing list