[slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

Fri Jan 7 16:34:04 UTC 2022

Okay, I verified the MultipleFiles approach on a testing slurm install 
with 1 control computer and two nodes and it works (with 
ConstrainDevices=yes)!

     Name=gpu Type=3090 
MultipleFiles=/dev/nvidia0,/dev/dri/card1,/dev/dri/renderD128
     Name=gpu Type=3090 
MultipleFiles=/dev/nvidia1,/dev/dri/card2,/dev/dri/renderD129

I could attach with various --gres=gpu:3090:* configs - one card, two 
cards, and I always got access only to the files belonging the the 
acquired cards - both in CUDA applications (nvidia*), in EGL (card*) or 
in VAAPI-accelerated ffmpeg (renderD*; these should also be what Vulkan 
uses, as I found out). With nvidia cards, it works flawlessly. I had a 
problem with i915 integrated GPU on a testing notebook - I could set it 
up as a gres (without the nvidia* device), I could claim it or use the 
renderD* device in ffmpeg, but VirtualGL did not run on the card* device...

With slurm 20.11, you get an unpleasant behavior of the environment 
variables, though. CUDA_VISIBLE_DEVICES and SLURM_STEP_GPUS contain 
garbage. On a 2-GPU machine, CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 and 
SLURM_STEP_GPUS=0,1,128,1,2,129. However, this bug got fixed in 21.08. 
And I think a workaround for 20.11 would be to just pick every 3rd value 
from these lists in a prolog script.

One last thing to mention - for the card* and renderD* devices to work 
in slurm, you have to set them to mode 666 in the phyiscal node 
machines. cgroups will take care about blacklisting the non-claimed 
devices. Also don't forget to add /dev/dri/card* and /dev/dri/renderD* 
to /etc/slurm/cgroup_allowed_devices_file.conf .

Now, the only thing that remains to verify is that the mapping between 
nvidiaX and cardX devices doesn't change with reboots. I wasn't able to 
find any documentation about how are either of these devices enumerated. 
On all machines I could access, the cardX and renderDY devices follow 
the same order, and I'd bet that it's given (as the render node is 
created by the same driver as the cardX device). Although you can't 
simply say Y=X+128 (see the example from my previous email where card0 
doesn't have any renderD). Experimentally, the order is not the PCI Bus 
ID order (0000:01:00.0 has card2, while 0000:41:00.0 has card1 on one 
machine). On all machines I could access, it also seemed to me that the 
relative order between nvidiaX and cardX devices remains the same. 
However, I know people say the ordering of nvidiaX devices can change 
between reboots (or at least I think I saw something like that written 
somewhere). Anyone has a pointer to more information?

Let me know if somebody else succeeds setting this up!

Martin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4482 bytes
Desc: Elektronicky podpis S/MIME
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220107/f5b94122/attachment-0001.bin>