[slurm-users] GPU configuration not working
Paul Raines
raines at nmr.mgh.harvard.edu
Thu Jul 23 14:32:20 UTC 2020
I have two systems in my cluster with GPUs. Their setup in slurm.conf is
GresTypes=gpu
NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1
SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1
SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
My gres.conf is simply
AutoDetect=nvml
When I start slurmd on mlscgpu2 for example the log shows
[2020-07-23T10:05:10.619] 5 GPU system device(s) detected
[2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=0
ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=64 Links=-1,0,2,0,0
[2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=1
ID=7696487 File=/dev/nvidia1 Cores=0-31 CoreCnt=64 Links=0,-1,0,0,0
[2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=2
ID=7696487 File=/dev/nvidia2 Cores=0-31 CoreCnt=64 Links=2,0,-1,0,0
[2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=3
ID=7696487 File=/dev/nvidia3 Cores=0-31 CoreCnt=64 Links=0,0,0,-1,2
[2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=4
ID=7696487 File=/dev/nvidia4 Cores=0-31 CoreCnt=64 Links=0,0,0,2,-1
[2020-07-23T10:05:10.626] slurmd version 20.02.3 started
[2020-07-23T10:05:10.627] slurmd started on Thu, 23 Jul 2020 10:05:10 -0400
[2020-07-23T10:05:10.627] CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2
Memory=1546557 TmpDisk=215198 Uptime=1723215 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
# scontrol show nodes | egrep '(^Node|Gres)'
NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16
Gres=gpu:10(S:0)
NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16
Gres=gpu:5(S:0)
Note who Gres above does not show "quadro_rtx_6000" and what does the (S:0)
mean?
Doing a submit like this fails:
$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1
salloc: error: Job submit/allocate failed: Requested node configuration is not
available
salloc: Job allocation 9 has been revoked.
This works but there is no CUDA device allocated in environment
$ salloc -n1 -c3 -p batch --gres=gpu:1
salloc: Granted job allocation 10
$ printenv | grep -i cuda
$
I have also tried changing gres.conf to and doing scontrol reconfigure
AutoDetect=nvml
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia0 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia1 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia2 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia3 Cores=0-31
Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia4 Cores=0-31
But this made no difference.
More information about the slurm-users
mailing list