[slurm-users] slurm and device cgroups

Ransom, Geoffrey M. Geoffrey.Ransom at jhuapl.edu
Thu Mar 4 21:20:16 UTC 2021


Hello
   I am trying to debug an issue with EGL support (updated NVIDIA drivers and now EGLGetDisplay and EGLQueryDevicesExt are failing if they can't access all /dev/nvidia# devices in slurm) and am wondering how slurm uses device cgroups so I can implement the same cgroup setup by hand and test our issue outside of slurm.

We have slurm 20.02.05 with cgroups set up with Cores, RAMSpace, and devices being constrained.

When I get onto an allocation with 1 of 4 GPUs on the system nvidia-smi only sees the GPU I was assigned and I get permission denied when I try to access the other /dev/nvidia# devices.
When I look at the cgroup either through the /sys/fs/cgroups/slurm/uid_######/job_#########/step_0 or with the cgget command I see the values for memory.limit_in_bytes and cpuset.cpus being set as expected, but the devices.list value is set to "a *:* rwm" even though I am being blocked from other devices. The device restriction does seem to working but the cgroup parameters for it are not set as I would expect to see.

How does slurm manage the device cgroup settings on RHEL 7 so I can check and mimic them?

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210304/f4a24464/attachment.htm>


More information about the slurm-users mailing list