[slurm-users] GPU configuration not working

Thu Jul 23 15:26:38 UTC 2020

After a complete shutdown and restart of all daemons, things have changed
somewhat

# scontrol show nodes | egrep '(^Node|Gres)'
NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16
    Gres=gpu:quadro_rtx_6000:10(S:0)
NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16
    Gres=gpu:quadro_rtx_6000:5(S:0)

and I can submit like this

mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1 --mem=1G
salloc: Granted job allocation 16
mlscgpu1[0]:~$ printenv | grep -i CUDA
mlscgpu1[0]:~$ printenv | grep -i slurm
SLURM_NODELIST=mlscgpu1
SLURM_JOB_NAME=bash
SLURM_NODE_ALIASES=(null)
SLURM_JOB_QOS=normal
SLURM_NNODES=1
SLURM_JOBID=16
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
SLURM_CONF=/etc/slurm/slurm.conf
SLURM_CPUS_PER_TASK=3
SLURM_JOB_ID=16
SLURM_SUBMIT_DIR=/autofs/homes/011/raines
SLURM_NPROCS=1
SLURM_JOB_NODELIST=mlscgpu1
SLURM_CLUSTER_NAME=mlsc
SLURM_JOB_CPUS_PER_NODE=4
SLURM_SUBMIT_HOST=mlscgpu1
SLURM_JOB_PARTITION=batch
SLURM_JOB_NUM_NODES=1
SLURM_MEM_PER_NODE=1024
mlscgpu1[0]:~$

But still no CUDA_VISIBLE_DEVICES is being set


On Thu, 23 Jul 2020 10:32am, Paul Raines wrote:

>
> I have two systems in my cluster with GPUs.  Their setup in slurm.conf is
>
> GresTypes=gpu
> NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1
> SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
> NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1
>  SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
>
> My gres.conf is simply
>
> AutoDetect=nvml
>
> When I start slurmd on mlscgpu2 for example the log shows
>
> [2020-07-23T10:05:10.619] 5 GPU system device(s) detected
> [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=0 
> ID=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=64 Links=-1,0,2,0,0
> [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=1 
> ID=7696487 File=/dev/nvidia1 Cores=0-31 CoreCnt=64 Links=0,-1,0,0,0
> [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=2 
> ID=7696487 File=/dev/nvidia2 Cores=0-31 CoreCnt=64 Links=2,0,-1,0,0
> [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=3 
> ID=7696487 File=/dev/nvidia3 Cores=0-31 CoreCnt=64 Links=0,0,0,-1,2
> [2020-07-23T10:05:10.619] Gres Name=gpu Type=quadro_rtx_6000 Count=1 Index=4 
> ID=7696487 File=/dev/nvidia4 Cores=0-31 CoreCnt=64 Links=0,0,0,2,-1
> [2020-07-23T10:05:10.626] slurmd version 20.02.3 started
> [2020-07-23T10:05:10.627] slurmd started on Thu, 23 Jul 2020 10:05:10 -0400
> [2020-07-23T10:05:10.627] CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 
> Memory=1546557 TmpDisk=215198 Uptime=1723215 CPUSpecList=(null) 
> FeaturesAvail=(null) FeaturesActive=(null)
>
> # scontrol show nodes | egrep '(^Node|Gres)'
> NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16
>   Gres=gpu:10(S:0)
> NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16
>    Gres=gpu:5(S:0)
>
> Note who Gres above does not show "quadro_rtx_6000" and what does the (S:0)
> mean?
>
> Doing a submit like this fails:
>
> $ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1
> salloc: error: Job submit/allocate failed: Requested node configuration is 
> not available
> salloc: Job allocation 9 has been revoked.
>
> This works but there is no CUDA device allocated in environment
>
> $ salloc -n1 -c3 -p batch --gres=gpu:1
> salloc: Granted job allocation 10
> $ printenv | grep -i cuda
> $
>
>
> I have also tried changing gres.conf to and doing scontrol reconfigure
>
> AutoDetect=nvml
> Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia0 Cores=0-31
> Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia1 Cores=0-31
> Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia2 Cores=0-31
> Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia3 Cores=0-31
> Name=gpu Type=quadro_rtx_6000 File=/dev/nvidia4 Cores=0-31
>
> But this made no difference.
>
>