[slurm-users] Using cgroups to hide GPUs on a shared controller/node

Tue May 21 11:39:25 UTC 2019

Sorry Dave, nothing handy. However look at this writeup from You Know Who:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/11599882/PP-325+Support+Cgroups
Look at the devices: Subsystem

You will need the major device number for the Nvidia devices, for example
on my system:
crw-rw-rw- 1 root root 195,   0 Mar  1 12:16 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Mar  1 12:16 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Mar  1 12:16 /dev/nvidiactl

Looking in /sys/fs/cgroup/devices there are two files:
--w-------   1 root root   0 May 21 12:28 devices.allow
--w-------   1 root root   0 May 21 12:28 devices.deny

These have the rather interesting properly of being write-only ...
So for a particular cgroup:
 echo "c 195:* rw"  > devices.deny
should deny access to character devices with major number 195
https://docs.oracle.com/cd/E37670_01/E41138/html/ol_devices_cgroups.html

On Tue, 21 May 2019 at 01:28, Dave Evans <rdevans at ece.ubc.ca> wrote:

> Do you have that resource handy? I looked into the cgroups documentation
> but I see very little on tutorials for modifying the permissions.
>
> On Mon, May 20, 2019 at 2:45 AM John Hearns <hearnsj at googlemail.com>
> wrote:
>
>> Two replies here.
>> First off for normal user logins you can direct them into a cgroup - I
>> looked into this about a year ago and it was actually quite easy.
>> As I remember there is a service or utility available which does just
>> that. Of course the user cgroup would not have
>>
>> Expanding on my theme, it is probably a good idea then to have all the
>> system processes contained in a 'boot cpuset' - is at system boot time
>> allocate a small number of cores to the system dacemons, Slurm processes
>> and probably the user login sessions.
>> Thus freeing up the other CPUs for batch jobs exclusively.
>>
>> Also you could try simply setting CUDA_VISIBLE_DEVICES to Null in one of
>> the system wide login scripts,
>>
>>
>>
>>
>>
>>
>>
>> On Mon, 20 May 2019 at 08:38, Nathan Harper <nathan.harper at cfms.org.uk>
>> wrote:
>>
>>> This doesn't directly answer your question, but in Feb last year on the
>>> ML there was a discussion about limiting user resources on login node
>>> (Stopping compute usage on login nodes).    Some of the suggestions
>>> included the use of cgroups to do so, and it's possible that those methods
>>> could be extended to limit access to GPUs, so it might be worth looking
>>> into.
>>>
>>> On Sat, 18 May 2019 at 00:28, Dave Evans <rdevans at ece.ubc.ca> wrote:
>>>
>>>>
>>>> We are using a single system "cluster" and want some control of
>>>> fair-use with the GPUs. The sers are not supposed to be able to use the
>>>> GPUs until they have allocated the resources through slurm. We have no head
>>>> node, so slurmctld, slurmdbd, and slurmd are all run on the same system.
>>>>
>>>> I have a configuration working now such that the GPUs can be scheduled
>>>> and allocated.
>>>> However logging into the system before allocating GPUs gives full
>>>> access to all of them.
>>>>
>>>> I would like to configure slurm cgroups to disable access to GPUs until
>>>> they have been allocated.
>>>>
>>>> On first login, I get:
>>>> nvidia-smi -q | grep UUID
>>>>     GPU UUID                        :
>>>> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
>>>>     GPU UUID                        :
>>>> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
>>>>     GPU UUID                        :
>>>> GPU-176d0514-0cf0-df71-e298-72d15f6dcd7f
>>>>     GPU UUID                        :
>>>> GPU-af03c80f-6834-cb8c-3133-2f645975f330
>>>>     GPU UUID                        :
>>>> GPU-ef10d039-a432-1ac1-84cf-3bb79561c0d3
>>>>     GPU UUID                        :
>>>> GPU-38168510-c356-33c9-7189-4e74b5a1d333
>>>>     GPU UUID                        :
>>>> GPU-3428f78d-ae91-9a74-bcd6-8e301c108156
>>>>     GPU UUID                        :
>>>> GPU-c0a831c0-78d6-44ec-30dd-9ef5874059a5
>>>>
>>>>
>>>> And running from the queue:
>>>> srun -N 1 --gres=gpu:2 nvidia-smi -q | grep UUID
>>>>     GPU UUID                        :
>>>> GPU-6076ce0a-bc03-a53c-6616-0fc727801c27
>>>>     GPU UUID                        :
>>>> GPU-5620ec48-7d76-0398-9cc1-f1fa661274f3
>>>>
>>>>
>>>> Pastes of my config files are:
>>>> ## slurm.conf ##
>>>> https://pastebin.com/UxP67cA8
>>>>
>>>>
>>>> *## cgroup.conf ##*
>>>> CgroupAutomount=yes
>>>> CgroupReleaseAgentDir="/etc/slurm/cgroup"
>>>>
>>>> ConstrainCores=yes
>>>> ConstrainDevices=yes
>>>> ConstrainRAMSpace=yes
>>>> #TaskAffinity=yes
>>>>
>>>> *## cgroup_allowed_devices_file.conf ## *
>>>> /dev/null
>>>> /dev/urandom
>>>> /dev/zero
>>>> /dev/sda*
>>>> /dev/cpu/*/*
>>>> /dev/pts/*
>>>> /dev/nvidia*
>>>>
>>>
>>>
>>> --
>>> *Nathan Harper* // IT Systems Lead
>>>
>>> *e: *nathan.harper at cfms.org.uk   *t*: 0117 906 1104  *m*:  0787 551
>>> 0891  *w: *www.cfms.org.uk
>>> CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // Emersons
>>> Green // Bristol // BS16 7FR
>>>
>>> CFMS Services Ltd is registered in England and Wales No 05742022 - a
>>> subsidiary of CFMS Ltd
>>> CFMS Services Ltd registered office // 43 Queens Square // Bristol //
>>> BS1 4QP
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190521/9dbe24dd/attachment.html>