[slurm-users] How does cgroups limit user access to GPUs?

Thu Apr 11 14:51:01 UTC 2019

Hi Randall,

could you please for a test add the following lines to the service part 
of the slurmd.service file (or add an override file).

Delegate=yes

Best
Marcus

On 4/11/19 3:11 PM, Randall Radmer wrote:
> It's now distressingly simple to reproduce this, based on Kilinan's 
> clue (off topic, "Kilinan's Clue" sounds like a good title for a Hardy 
> Boys Mystery Story).
>
> After limited testing, seems to me that running "systemctl 
> daemon-reload"  followed by "systemctl restart slurmd" breaks it.  See 
> below:
>
> [computelab-305:~]$ sudo systemctl restart slurmd
> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
> index, name
> 0, Tesla T4
> [computelab-305:~]$ sudo systemctl daemon-reload
> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
> index, name
> 0, Tesla T4
> [computelab-305:~]$ sudo systemctl restart slurmd
> [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
> index, name
> 0, Tesla T4
> 1, Tesla T4
> 2, Tesla T4
> 3, Tesla T4
> 4, Tesla T4
> 5, Tesla T4
> 6, Tesla T4
> 7, Tesla T4
> [computelab-305:~]$ slurmd -V
> slurm 17.11.9-2
>
>
> On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti 
> <kilian.cavalotti.work at gmail.com 
> <mailto:kilian.cavalotti.work at gmail.com>> wrote:
>
>     Hi Randy!
>
>     > We have a slurm cluster with a number of nodes, some of which
>     have more than one GPU.  Users select how many or which GPUs they
>     want with srun's "--gres" option.  Nothing fancy here, and in
>     general this works as expected.  But starting a few days ago we've
>     had problems on one machine.  A specific user started a single-gpu
>     session with srun, and nvidia-smi reported one GPU, as expected. 
>     But about two hours later, he suddenly could see all GPUs with
>     nvidia-smi.  To be clear, this is all from the iterative session
>     provided by Slurm.  He did not ssh to the machine.  He's not
>     running Docker.  Nothing odd as far as we can tell.
>     >
>     > A big problem is I've been unable to reproduce the problem.  I
>     have confidence that what this user is telling me is correct, but
>     I can't do much until/unless I can reproduce it.
>
>     I think this kind of behavior has already been reported a few times:
>     https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
>     https://bugs.schedmd.com/show_bug.cgi?id=5300
>
>     As far as I can tell, it looks like this is probably systemd messing
>     up with cgroups and deciding it's the king of cgroups on the host.
>
>     You'll find more context and details in
>     https://bugs.schedmd.com/show_bug.cgi?id=5292
>
>     Cheers,
>     -- 
>     Kilian
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190411/1c316ef6/attachment.html>