[slurm-users] GPU allocation problems

Mon Mar 12 04:02:46 MDT 2018

Hi,

This is just a guess, but there's also a cgroup.conf file where you
might need to add:

ConstrainDevices=yes

see:
https://slurm.schedmd.com/cgroup.conf.html

for more details.

HTH,
    Yair.

On Mon, Mar 12 2018, Sefa Arslan <sefa.arslan at tubitak.gov.tr> wrote:

> Dear all,
>
> We have upgraded our cluster from 13 to slurm17.11.. We have some problem with
> gpu configurations.. Although I request no GPUs, system let me use gpu cards.. 
>
> Let me explain..
> Slurm.conf:
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory 
> TaskPlugin=task/cgroup
> TaskPlugin=task/cgroup
> PreemptType=preempt/none
>
> NodeName=cudanode[1-20] Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2
> RealMemory=384000 Gres=gpu:2
> PartitionName=cuda Nodes=cudanode[1-20] Default=no MaxTime=15-00:00:00
> defaulttime=00:02:00 State=UP DefMemPerCPU=8500 MaxMemPerNode=380000 Shared=NO
> Priority=1000
>
> Gres.conf:
> Name=gpu File=/dev/nvidia0
> CPUs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
> Name=gpu File=/dev/nvidia1
> CPUs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
>
> I am testing the configuration with deviceQuery app comes with cuda9 pack.
>
> When I send a job with 2 gpus, system reserved right number of GPUS..
> srun -n 1 -p cuda --nodelist=cudanode1 --gres=gpu:2 ./cuda.sh
> CUDA_VISIBLE_DEVICES: 0,1
> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime
> Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla
> P100-PCIE-16GB Result = PASS 
>
> When I send a job with 1 gpus, system reserved right number of GPUS..
>
> srun -n 1 -p cuda --nodelist=cudanode1 --gres=gpu:1 ./cuda.sh 
> CUDA_VISIBLE_DEVICES: 0
> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime
> Version = 7.5, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB
> Result = PASS
>
> But when I send a job without any GPUS, system also let me use GPUS, that I
> dont expect.
> srun -n 1 -p cuda --nodelist=cudanode1 ./cuda.sh 
> CUDA_VISIBLE_DEVICES: 
> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime
> Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB, Device1 = Tesla
> P100-PCIE-16GB
> Result = PASS
>
> By this way. I am able to run 40 jobs which all use the gpus on one server at
> the same time. Is it a bug or I missed something ? While we use previous
> versions of slurm, gpu allocation was how I expected. I also tried with
> cuda-enabled namd which uses higher level hardware access methods and I get the
> same result.
>
> Another problem I hit, when I change the gpu configuration from Gres=gpu:2 to
> Gres=gpu:no_consume:2 to be able to use simultaneously by many jobs, system let
> me use all cards independent of how many cards I request.. 
>
> Regards,
> Sefa ARSLAN