[slurm-users] How does cgroups limit user access to GPUs?
Randall Radmer
radmer at gmail.com
Wed Apr 10 19:42:18 UTC 2019
We have a slurm cluster with a number of nodes, some of which have more
than one GPU. Users select how many or which GPUs they want with srun's
"--gres" option. Nothing fancy here, and in general this works as
expected. But starting a few days ago we've had problems on one machine.
A specific user started a single-gpu session with srun, and nvidia-smi
reported one GPU, as expected. But about two hours later, he suddenly
could see all GPUs with nvidia-smi. To be clear, this is all from the
iterative session provided by Slurm. He did not ssh to the machine. He's
not running Docker. Nothing odd as far as we can tell.
A big problem is I've been unable to reproduce the problem. I have
confidence that what this user is telling me is correct, but I can't do
much until/unless I can reproduce it.
My general question for this group is how to debug this if/when I am able
to reproduce it?
My specific question is how does Slum limit user access to GPUs? I
understand it's with cgroups, and I think I can see how it works with CPUs
(i.e. /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus). But I can't
figure out where or how GPUs are assigned to cgroups (if that's even the
correct phrase).
I did not find anything that looked interesting under
/sys/fs/cgroup/devices/slurm/uid_*/job_*
For example:
$ find /sys/fs/cgroup/devices/slurm/uid_*/job_* -name devices.list -exec
cat {} \;
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm
Thanks much,
Randy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190410/7cfef966/attachment.html>
More information about the slurm-users
mailing list