[slurm-users] GPU gres error for 1 of 3 GPU types

Sean McGrath smcgrat at tchpc.tcd.ie
Fri Jan 11 10:21:10 UTC 2019


Hi everyone,

Your help for this would be much appreciated please.

We have a cluster with 3 types of gpu configured in gres. Users can successfully
request 2 of the gpu types but the third errors when requested.

Here is the successful salloc behaviour:

root at boole01:/etc/slurm # salloc --gres=gpu:tesla:1 -N 1
salloc: Granted job allocation 271558
[root at boole-n019:/etc/slurm]# exit
salloc: Relinquishing job allocation 271558
root at boole01:/etc/slurm # salloc --gres=gpu:volta:1 -N 1
salloc: Pending job allocation 271559
salloc: job 271559 queued and waiting for resources
^Csalloc: Job allocation 271559 has been revoked.
salloc: Job aborted due to signal

And the unsuccessful salloc behaviour:

root at boole01:/etc/slurm # salloc --gres=gpu:2080ti:1 -N 1
salloc: error: Job submit/allocate failed: Invalid generic resource (gres)
specification

Slurm.log output for successful salloc's:

[2019-01-11T10:13:36.434] sched: _slurm_rpc_allocate_resources JobId=271558
NodeList=boole-n019 usec=30495
[2019-01-11T10:13:42.485] _job_complete: JobId=271558 WEXITSTATUS 0
[2019-01-11T10:13:42.486] _job_complete: JobId=271558 done
[2019-01-11T10:13:46.000] sched: _slurm_rpc_allocate_resources JobId=271559
NodeList=(null) usec=15674
[2019-01-11T10:13:48.778] _job_complete: JobId=271559 WTERMSIG 126
[2019-01-11T10:13:48.778] _job_complete: JobId=271559 cancelled by interactive
user
[2019-01-11T10:13:48.778] _job_complete: JobId=271559 done

Slurm.log output for unsuccessful salloc's:

[2019-01-11T10:13:55.755] _get_next_job_gres: Invalid GRES job specification
gpu:2080ti:1
[2019-01-11T10:13:55.755] _slurm_rpc_allocate_resources: Invalid generic
resource (gres) specification


Slurm gres configuration:

root at boole01:/etc/slurm # grep -i gres slurm.conf | grep -v ^#
GresTypes=gpu,mic
NodeName=boole-n[018-023] Gres=gpu:tesla:2 RealMemory=256000 Sockets=2
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=50
NodeName=boole-n024 Gres=gpu:2080ti:2 RealMemory=256000 Sockets=2
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=100
NodeName=boole-n016 Gres=gpu:volta:2 RealMemory=256000 Sockets=2
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=200

gres.conf:

root at boole01:/etc/slurm # cat gres.conf
NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia0
NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia1
NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0
NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1
NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia0
NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia1
#NodeName=boole-n017 Name=mic File=/dev/mic0
#NodeName=boole-n017 Name=mic File=/dev/mic1

Please let me know if there is anymore info that would be helpful for this?

What am I missing or doing wrong?

Many thanks in advance.

Sean


-- 
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin

sean.mcgrath at tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725




More information about the slurm-users mailing list