[slurm-users] How does cgroups limit user access to GPUs?

Wed Apr 10 21:53:21 UTC 2019

Hi Randy!

> We have a slurm cluster with a number of nodes, some of which have more than one GPU.  Users select how many or which GPUs they want with srun's "--gres" option.  Nothing fancy here, and in general this works as expected.  But starting a few days ago we've had problems on one machine.  A specific user started a single-gpu session with srun, and nvidia-smi reported one GPU, as expected.  But about two hours later, he suddenly could see all GPUs with nvidia-smi.  To be clear, this is all from the iterative session provided by Slurm.  He did not ssh to the machine.  He's not running Docker.  Nothing odd as far as we can tell.
>
> A big problem is I've been unable to reproduce the problem.  I have confidence that what this user is telling me is correct, but I can't do much until/unless I can reproduce it.

I think this kind of behavior has already been reported a few times:
https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
https://bugs.schedmd.com/show_bug.cgi?id=5300

As far as I can tell, it looks like this is probably systemd messing
up with cgroups and deciding it's the king of cgroups on the host.

You'll find more context and details in
https://bugs.schedmd.com/show_bug.cgi?id=5292

Cheers,
-- 
Kilian