[slurm-users] How does cgroups limit user access to GPUs?
Randall Radmer
radmer at gmail.com
Thu Apr 11 11:22:52 UTC 2019
Thanks Kilian! I'll look at this today.
-Randy
On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti <
kilian.cavalotti.work at gmail.com> wrote:
> Hi Randy!
>
> > We have a slurm cluster with a number of nodes, some of which have more
> than one GPU. Users select how many or which GPUs they want with srun's
> "--gres" option. Nothing fancy here, and in general this works as
> expected. But starting a few days ago we've had problems on one machine.
> A specific user started a single-gpu session with srun, and nvidia-smi
> reported one GPU, as expected. But about two hours later, he suddenly
> could see all GPUs with nvidia-smi. To be clear, this is all from the
> iterative session provided by Slurm. He did not ssh to the machine. He's
> not running Docker. Nothing odd as far as we can tell.
> >
> > A big problem is I've been unable to reproduce the problem. I have
> confidence that what this user is telling me is correct, but I can't do
> much until/unless I can reproduce it.
>
> I think this kind of behavior has already been reported a few times:
> https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
> https://bugs.schedmd.com/show_bug.cgi?id=5300
>
> As far as I can tell, it looks like this is probably systemd messing
> up with cgroups and deciding it's the king of cgroups on the host.
>
> You'll find more context and details in
> https://bugs.schedmd.com/show_bug.cgi?id=5292
>
> Cheers,
> --
> Kilian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190411/c134e474/attachment.html>
More information about the slurm-users
mailing list