[slurm-users] How does cgroups limit user access to GPUs?

Wed Apr 10 19:42:18 UTC 2019

We have a slurm cluster with a number of nodes, some of which have more
than one GPU.  Users select how many or which GPUs they want with srun's
"--gres" option.  Nothing fancy here, and in general this works as
expected.  But starting a few days ago we've had problems on one machine.
A specific user started a single-gpu session with srun, and nvidia-smi
reported one GPU, as expected.  But about two hours later, he suddenly
could see all GPUs with nvidia-smi.  To be clear, this is all from the
iterative session provided by Slurm.  He did not ssh to the machine.  He's
not running Docker.  Nothing odd as far as we can tell.

A big problem is I've been unable to reproduce the problem.  I have
confidence that what this user is telling me is correct, but I can't do
much until/unless I can reproduce it.

My general question for this group is how to debug this if/when I am able
to reproduce it?

My specific question is how does Slum limit user access to GPUs?  I
understand it's with cgroups, and I think I can see how it works with CPUs
(i.e. /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus).  But I can't
figure out where or how GPUs are assigned to cgroups (if that's even the
correct phrase).

I did not find anything that looked interesting under
/sys/fs/cgroup/devices/slurm/uid_*/job_*

For example:
$ find /sys/fs/cgroup/devices/slurm/uid_*/job_* -name devices.list -exec
cat {} \;
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm

Thanks much,
Randy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190410/7cfef966/attachment.html>