[slurm-users] GPU / cgroup challenges

R. Paul Wiegand rpwiegand at gmail.com
Tue May 1 15:24:42 MDT 2018


Greetings,

I am setting up our new GPU cluster, and I seem to have a problem
configuring things so that the devices are properly walled off via
cgroups.  Our nodes each of two GPUS; however, if --gres is unset, or
set to --gres=gpu:0, I can access both GPUs from inside a job.
Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES
environmental variable, I can access both GPUs.  From my
understanding, this suggests that it is *not* being protected under
cgroups.

I've read the documentation, and I've read through a number of threads
where people have resolved similar issues.  I've tried a lot of
configurations, but to no avail. Below I include some snippets of
relevant (current) parameters; however, I also am attaching most of
our full conf files.

[slurm.conf]
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
JobAcctGatherType=jobacct_gather/linux
AccountingStorageTRES=gres/gpu
GresTypes=gpu

NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16
ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2

[gres.conf]
NodeName=evc[1-10] Name=gpu File=/dev/nvidia0
COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NodeName=evc[1-10] Name=gpu File=/dev/nvidia1
COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

[cgroup.conf]
ConstrainDevices=yes

[cgroup_allowed_devices_file.conf]
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*

Thanks,
Paul.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cgroup_allowed_devices_file.conf
Type: application/octet-stream
Size: 67 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180501/9497813d/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cgroup.conf
Type: application/octet-stream
Size: 98 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180501/9497813d/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gres.conf
Type: application/octet-stream
Size: 272 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180501/9497813d/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 6225 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180501/9497813d/attachment-0003.obj>


More information about the slurm-users mailing list