[slurm-users] GPU allocation problems

Sefa Arslan sefa.arslan at tubitak.gov.tr
Mon Mar 12 01:24:12 MDT 2018


Dear all,

We have upgraded our cluster from 13 to slurm17.11.. We have some
problem with gpu  configurations.. Although I request no GPUs, system
let me use gpu cards..

Let me explain..
*Slurm.conf:
*SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/cgroup
TaskPlugin=task/cgroup
PreemptType=preempt/none

NodeName=cudanode[1-20]  Procs=40   Sockets=2  CoresPerSocket=20
ThreadsPerCore=2 RealMemory=384000   Gres=gpu:2
PartitionName=cuda       Nodes=cudanode[1-20]   Default=no   
MaxTime=15-00:00:00 defaulttime=00:02:00 State=UP DefMemPerCPU=8500
MaxMemPerNode=380000  Shared=NO Priority=1000

*Gres.conf:*
Name=gpu File=/dev/nvidia0
CPUs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
Name=gpu File=/dev/nvidia1
CPUs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79


I am testing the configuration  with deviceQuery app comes with cuda9 pack.

When I send a job with 2 gpus, system reserved  right number of GPUS..
*srun   -n 1   -p **cuda   --nodelist=**cudanode1  --gres=gpu:2 ./cuda.sh*
CUDA_VISIBLE_DEVICES:  0,1
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA
Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB,
Device1 = Tesla P100-PCIE-16GB Result = PASS

When I send a job with 1 gpus, system reserved  right number of GPUS..

*srun   -n 1   -p **cuda   --nodelist=**cudanode1  --gres=gpu:1 ./cuda.sh *
CUDA_VISIBLE_DEVICES:  0
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA
Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB
Result = PASS

*But when I send a job without any GPUS, system also let me use  GPUS,
that I dont expect.
srun   -n 1   -p cuda   --nodelist=cudanode1  ./cuda.sh
CUDA_VISIBLE_DEVICES: 
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA
Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla P100-PCIE-16GB,
Device1 = Tesla P100-PCIE-16GB
Result = PASS*

By this way. I am able to run 40 jobs which all use the gpus  on one
server at the same time. Is it a bug or I missed something ? While we
use  previous versions of slurm, gpu allocation was how I expected. I
also tried with cuda-enabled namd which uses higher level hardware
access methods and I get the same result.

Another problem I hit, when I change the gpu configuration from
Gres=gpu:2 to Gres=gpu:no_consume:2 to be able to  use simultaneously by
many jobs, system let me use all cards independent of how many cards I
request..


Regards,
Sefa ARSLAN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180312/cab09c27/attachment.html>


More information about the slurm-users mailing list