[slurm-users] GPU gres error for 1 of 3 GPU types

Paul Edmon pedmon at cfa.harvard.edu
Fri Jan 11 15:35:09 UTC 2019


I'm pretty sure that gres.conf has to be on all the nodes as well and 
not just the master.

-Paul Edmon-

On 1/11/19 5:21 AM, Sean McGrath wrote:
> Hi everyone,
>
> Your help for this would be much appreciated please.
>
> We have a cluster with 3 types of gpu configured in gres. Users can successfully
> request 2 of the gpu types but the third errors when requested.
>
> Here is the successful salloc behaviour:
>
> root at boole01:/etc/slurm # salloc --gres=gpu:tesla:1 -N 1
> salloc: Granted job allocation 271558
> [root at boole-n019:/etc/slurm]# exit
> salloc: Relinquishing job allocation 271558
> root at boole01:/etc/slurm # salloc --gres=gpu:volta:1 -N 1
> salloc: Pending job allocation 271559
> salloc: job 271559 queued and waiting for resources
> ^Csalloc: Job allocation 271559 has been revoked.
> salloc: Job aborted due to signal
>
> And the unsuccessful salloc behaviour:
>
> root at boole01:/etc/slurm # salloc --gres=gpu:2080ti:1 -N 1
> salloc: error: Job submit/allocate failed: Invalid generic resource (gres)
> specification
>
> Slurm.log output for successful salloc's:
>
> [2019-01-11T10:13:36.434] sched: _slurm_rpc_allocate_resources JobId=271558
> NodeList=boole-n019 usec=30495
> [2019-01-11T10:13:42.485] _job_complete: JobId=271558 WEXITSTATUS 0
> [2019-01-11T10:13:42.486] _job_complete: JobId=271558 done
> [2019-01-11T10:13:46.000] sched: _slurm_rpc_allocate_resources JobId=271559
> NodeList=(null) usec=15674
> [2019-01-11T10:13:48.778] _job_complete: JobId=271559 WTERMSIG 126
> [2019-01-11T10:13:48.778] _job_complete: JobId=271559 cancelled by interactive
> user
> [2019-01-11T10:13:48.778] _job_complete: JobId=271559 done
>
> Slurm.log output for unsuccessful salloc's:
>
> [2019-01-11T10:13:55.755] _get_next_job_gres: Invalid GRES job specification
> gpu:2080ti:1
> [2019-01-11T10:13:55.755] _slurm_rpc_allocate_resources: Invalid generic
> resource (gres) specification
>
>
> Slurm gres configuration:
>
> root at boole01:/etc/slurm # grep -i gres slurm.conf | grep -v ^#
> GresTypes=gpu,mic
> NodeName=boole-n[018-023] Gres=gpu:tesla:2 RealMemory=256000 Sockets=2
> CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=50
> NodeName=boole-n024 Gres=gpu:2080ti:2 RealMemory=256000 Sockets=2
> CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=100
> NodeName=boole-n016 Gres=gpu:volta:2 RealMemory=256000 Sockets=2
> CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN Weight=200
>
> gres.conf:
>
> root at boole01:/etc/slurm # cat gres.conf
> NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia0
> NodeName=boole-n[018-023] Name=gpu Type=tesla File=/dev/nvidia1
> NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia0
> NodeName=boole-n024 Name=gpu Type=2080ti File=/dev/nvidia1
> NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia0
> NodeName=boole-n016 Name=gpu Type=volta File=/dev/nvidia1
> #NodeName=boole-n017 Name=mic File=/dev/mic0
> #NodeName=boole-n017 Name=mic File=/dev/mic1
>
> Please let me know if there is anymore info that would be helpful for this?
>
> What am I missing or doing wrong?
>
> Many thanks in advance.
>
> Sean
>
>



More information about the slurm-users mailing list