[slurm-users] GPU / cgroup challenges

Tue May 1 17:23:45 MDT 2018

Thanks Kevin!

Indeed, nvidia-smi in an interactive job tells me that I can get access to
the device when I should not be able to.

I thought including the /dev/nvidia* would whitelist those devices ...
which seems to be the opposite of what I want, no?  Or do I misunderstand?

Thanks,
Paul

On Tue, May 1, 2018, 19:00 Kevin Manalo <kmanalo at jhu.edu> wrote:

> Paul,
>
> Having recently set this up, this was my test, when you make a single GPU
> request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty
> bash) request you should only see the GPU assigned to you via 'nvidia-smi'
>
> When gres is unset you should see
>
> nvidia-smi
> No devices were found
>
> Otherwise, if you ask for 1 of 2, you should only see 1 device.
>
> Also, I recall appending this to the bottom of
>
> [cgroup_allowed_devices_file.conf]
> ..
> Same as yours
> ...
> /dev/nvidia*
>
> There was a SLURM bug issue that made this clear, not so much in the
> website docs.
>
> -Kevin
>
>
> On 5/1/18, 5:28 PM, "slurm-users on behalf of R. Paul Wiegand" <
> slurm-users-bounces at lists.schedmd.com on behalf of rpwiegand at gmail.com>
> wrote:
>
>     Greetings,
>
>     I am setting up our new GPU cluster, and I seem to have a problem
>     configuring things so that the devices are properly walled off via
>     cgroups.  Our nodes each of two GPUS; however, if --gres is unset, or
>     set to --gres=gpu:0, I can access both GPUs from inside a job.
>     Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES
>     environmental variable, I can access both GPUs.  From my
>     understanding, this suggests that it is *not* being protected under
>     cgroups.
>
>     I've read the documentation, and I've read through a number of threads
>     where people have resolved similar issues.  I've tried a lot of
>     configurations, but to no avail. Below I include some snippets of
>     relevant (current) parameters; however, I also am attaching most of
>     our full conf files.
>
>     [slurm.conf]
>     ProctrackType=proctrack/cgroup
>     TaskPlugin=task/cgroup
>     SelectType=select/cons_res
>     SelectTypeParameters=CR_Core_Memory
>     JobAcctGatherType=jobacct_gather/linux
>     AccountingStorageTRES=gres/gpu
>     GresTypes=gpu
>
>     NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16
>     ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2
>
>     [gres.conf]
>     NodeName=evc[1-10] Name=gpu File=/dev/nvidia0
>     COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
>     NodeName=evc[1-10] Name=gpu File=/dev/nvidia1
>     COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
>
>     [cgroup.conf]
>     ConstrainDevices=yes
>
>     [cgroup_allowed_devices_file.conf]
>     /dev/null
>     /dev/urandom
>     /dev/zero
>     /dev/sda*
>     /dev/cpu/*/*
>     /dev/pts/*
>
>     Thanks,
>     Paul.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180501/32beacf8/attachment-0001.html>