[slurm-users] slurm and device cgroups
Ransom, Geoffrey M.
Geoffrey.Ransom at jhuapl.edu
Thu Mar 4 22:17:48 UTC 2021
Well, reading the source it looks like xcgroup_set_params is just writing to the devices.allow and devices.deny files. I haven't yet found what cg->path is being set to but presumably it is too /sys/fs/cgroup/slurm/uid_######/job_#########/step_0 or equivalent for the job in question.
I'm still not sure why "cgget -a /slurm/uid_######/job_#########/step_0" shows reasonable values for cores and memory but not the device.list entry.
If anyone has any insight on how I can see and reproduce the cgroups device setup of a slurm job by hand it would be appreciated.
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Ransom, Geoffrey M.
Sent: Thursday, March 4, 2021 4:20 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [EXT] Alert-Verify-Sender: [slurm-users] slurm and device cgroups
APL external email warning: Verify sender slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com> before clicking links or attachments
Hello
I am trying to debug an issue with EGL support (updated NVIDIA drivers and now EGLGetDisplay and EGLQueryDevicesExt are failing if they can't access all /dev/nvidia# devices in slurm) and am wondering how slurm uses device cgroups so I can implement the same cgroup setup by hand and test our issue outside of slurm.
We have slurm 20.02.05 with cgroups set up with Cores, RAMSpace, and devices being constrained.
When I get onto an allocation with 1 of 4 GPUs on the system nvidia-smi only sees the GPU I was assigned and I get permission denied when I try to access the other /dev/nvidia# devices.
When I look at the cgroup either through the /sys/fs/cgroups/slurm/uid_######/job_#########/step_0 or with the cgget command I see the values for memory.limit_in_bytes and cpuset.cpus being set as expected, but the devices.list value is set to "a *:* rwm" even though I am being blocked from other devices. The device restriction does seem to working but the cgroup parameters for it are not set as I would expect to see.
How does slurm manage the device cgroup settings on RHEL 7 so I can check and mimic them?
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210304/6a07cd2b/attachment.htm>
More information about the slurm-users
mailing list