> We have a slurm cluster with a number of nodes, some of which have more than one GPU.  Users select how many or which GPUs they want with srun's "--gres" option.  Nothing fancy here, and in general this works as expected.  But starting a few days ago we've had problems on one machine.  A specific user started a single-gpu session with srun, and nvidia-smi reported one GPU, as expected.  But about two hours later, he suddenly could see all GPUs with nvidia-smi.  To be clear, this is all from the iterative session provided by Slurm.  He did not ssh to the machine.  He's not running Docker.  Nothing odd as far as we can tell.
> A big problem is I've been unable to reproduce the problem.  I have confidence that what this user is telling me is correct, but I can't do much until/unless I can reproduce it.

I think this kind of behavior has already been reported a few times:

As far as I can tell, it looks like this is probably systemd messing
up with cgroups and deciding it's the king of cgroups on the host.

You'll find more context and details in


