[slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

Thu Jan 6 19:27:28 UTC 2022

Hi Martin,

My (quick and unrefined) thoughts about this:

This could only work if you don't have ConstrainDevices=yes in your 
cgroup.conf. Which I don't think is a good idea, as jobs can use GPUs 
allocated to other jobs.

Let's assume you don't use ConstrainDevices=yes:
The GPU's allocated to a job can only safely be identified in the job's 
context (task prologue). I assume you're aware of this, as your 
suggesting to use SLURM_STEP_GPUS or CUDA_VISIBLE_DEVICES.

On a sidenote: AFAIK, these environment variables are supposed to be 
identical to the minor PCI device number (for CUDA_VISIBLE_DEVICES, 
provided CUDA_DEVICE_ORDER=PCI_BUS_ID is set). This might change after a 
node is rebooted. For your use case, this shouldn't matter, though.

Then my question is, how can you safely use cgset in a jobs context 
(i.e. with it's user's privileges) to modify access to /dev/dri/card* etc?

Is your goal to enable VirtualGL for jobs? If it is, I tried a solution 
with packing it, its dependencies, a minimal X11 server and turbovnc 
into a singularity image which can be used in a job.
This worked as a proof of concept for glxgears, but not for the software 
users wanted to run.

Eventually this might work with Vulkan instead of OpenGL. Software in 
question would have to be updated, too, GPU drivers would have to 
support the needed Vulkan features as well.

Any more thoughts and insights about this topic are appreciated by me as 
well.

Best,
Stephan

On 06.01.22 18:20, Martin Pecka wrote:
> Hello, I'm reviving a bit of old thread, but I just noticed I don't see 
> my January 2021 message in the archives, so I'm sending it again now 
> that the issue again got live on our side.
> 
> 
> To quickly recap, we want to add permissions not only to /dev/nvidia* 
> devices based on the requested gres, but also to the corresponding 
> /dev/dri/card* and /dev/dri/renderD* devices - they are all connected to 
> the same GPU, but the additional two allow using the card for rendering 
> instead of CUDA computations etc. I had some idea how to achieve that 
> without changing SLURM codebase, and I got something that could almost 
> work. It probably just needs some polishing. Could anybody please 
> comment whether the proposed solution is a good idea?
> 
> 
> The 15 Jan 2021 message:
> 
> 
> So I started thinking if this could not be somehow handled by a prologue 
> script and direct cgroup manipulation? I'm no expert in either, so 
> please check my lines of thoughts.
> 
> 
> |#!/bin/bash PATH=/usr/bin/:/bin 
> gpus=${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS} # or CUDA_VISIBLE_DEVICES when 
> run inside the cgroup? cgroup=$(cat /proc/self/cgroup | grep devices | 
> cut -d: -f3) # or something else? # blacklist all DRM devices (major 
> 226) cgset -r devices.deny="a 226:* rwm" devices:${cgroup} for 
> NVIDIA_SMI_ID in |||${gpus//,/ }|; do # find on which PCI path does this device sit 
> pci_id=$(nvidia-smi -i $NVIDIA_SMI_ID --query-gpu=pci.bus_id 
> --format=noheader,csv | tail -c+5 | tr '[:upper:]' '[:lower:]') # find 
> the DRM devices sitting on the same PCI bus card=$(ls 
> /sys/bus/pci/devices/${pci_id}/drm/ | grep card | xargs basename) 
> render=$(ls /sys/bus/pci/devices/${pci_id}/drm/ | grep renderD | xargs 
> basename) # allow access to the DRM devices [ -n "${card}" ] && |||cgset -r devices.allow="c $(cat /sys/class/drm/${card}/dev) rw" 
> devices:${cgroup} && echo "Allowed /dev/dri/${card} DRI device access"| |||[ -n "${render}" ] && |||cgset -r devices.allow="c $(cat 
> /sys/class/drm/${render}/dev) rw" devices:${cgroup}||||||&& echo "Allowed /dev/dri/${render} render node access"|| done |
> 
> Now I wonder whether this should be Prolog=, TaskProlog= or something 
> else (that would also change whether I look at CUDA_VISIBLE_DEVICES or 
> SLURM_STEP_GPUS, and how I figure out the cgroup name). I guess that 
> were this script run as the invoking user, then nothing would prevent 
> him from gaining access to all devices again. So I'd incline to treat it 
> as a Prolog= script run by root. How would I get the cgroup ID then? 
> Compose it from parts as mentioned in the slurm cgroups docs? 
> (/cgroup/cpuset/slurm/uid_100/job_123/step_0/task_2) Or is there a more 
> reliable way?
> 
> 
> A related but offtopic idea popped up in my head when thinking about 
> GPUs. Most of them are actually a consolidation of more devices like 
> stream processors, encoders, decoders, raytraces, shaders, memory etc. 
> Could it be possible (in future) to actually offer each of these pieces 
> as a different gres? The problem is most of them do not have any special 
> file which the user could lock to tell the others he's playing there 
> now. So it'd probably require support at the level of cgroup 
> implemetation, which, in turn, would require changing all GPU drivers. 
> And it would require being able to request just chunks of GPU memory 
> (not sure if that's possible right now, but I think I saw some pull 
> request about that).
> 
> 
> Thank you for hints!
> 
> 
> Martin
> 
> 
> Dne 21.10.2020 v 19:09 Martin Pecka napsal(a):
> 
>> Or maybe could this be "emulated" by a set of 3 GRES per card that are 
>> "linked" together? I.e. rules like "if the user requests GRES 
>> /dev/dri/card0, he will also automatically need to claim 
>> /dev/dri/renderD128 and /dev/nvidia0"?
>>
>>
>> Dne 21.10.2020 v 18:52 Daniel Letai napsal(a):
>>
>>> Take a look at https://github.com/SchedMD/slurm/search?q=dri%2F
>>>
>>> If the ROCM-SMI API is present, using AutoDetect=rsmi in gres.conf 
>>> might be enough, if I'm reading this right.
>>>
>>>
>>> Of course, this assumes the cards in question are AMD and not NVIDIA.
>>>
>>>
>>> On 20/10/2020 23:58, Mgr. Martin Pecka wrote:
>>>> Pinging this topic again. Nobody has an idea how to define multiple 
>>>> files to be treated as a single gres?
>>>>
>>>> Thank you for help,
>>>>
>>>> Martin Pecka
>>>>
>>>> Dne 4.9.2020 v 21:29 Martin Pecka napsal(a):
>>>>
>>>>> Hello, we want to use EGL backend for accessing OpenGL without the 
>>>>> need for Xorg. This approach requires access to devices 
>>>>> /dev/dri/card* and /dev/dri/renderD* . Is there a way to give 
>>>>> access to these devices along with /dev/nvidia* which we use for 
>>>>> CUDA? Ideally as a single generic resource that would give 
>>>>> permissions to all three files at once.
>>>>>
>>>>> Thank you for any tips.
>>>>>
>>>>
>>>>
>>
> 

-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59  |  ETF D 104  |  Sternwartstrasse 7  | 8092 Zurich
-------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220106/dc9cb08c/attachment-0001.bin>