[slurm-users] How does cgroups limit user access to GPUs?

Marcus Wagner wagner at itc.rwth-aachen.de
Thu Apr 11 15:17:54 UTC 2019


I assume without Delegate=yes this would happen also to regular jobs, 
which means, nightly updates could "destroy" the cgroups created by 
slurm and therefore let the jobs out "into the wild".

Best
Marcus

P.S.:
We had a similar problem with LSF

On 4/11/19 3:58 PM, Randall Radmer wrote:
> Yes, I was just testing that.  Adding "Delegate=yes" seems to fix the 
> problem (see below), but wanted to try a few more things before saying 
> anything.
>
> [computelab-136:~]$ grep ^Delegate /etc/systemd/system/slurmd.service
> Delegate=yes
> [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv
> index, name
> 0, Tesla T4
> [computelab-136:~]$ sudo systemctl daemon-reload; sudo systemctl 
> restart slurmd
> [computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv
> index, name
> 0, Tesla T4
>
>
>
> On Thu, Apr 11, 2019 at 7:53 AM Marcus Wagner 
> <wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>> wrote:
>
>     Hi Randall,
>
>     could you please for a test add the following lines to the service
>     part of the slurmd.service file (or add an override file).
>
>     Delegate=yes
>
>
>     Best
>     Marcus
>
>
>
>     On 4/11/19 3:11 PM, Randall Radmer wrote:
>>     It's now distressingly simple to reproduce this, based on
>>     Kilinan's clue (off topic, "Kilinan's Clue" sounds like a good
>>     title for a Hardy Boys Mystery Story).
>>
>>     After limited testing, seems to me that running "systemctl
>>     daemon-reload"  followed by "systemctl restart slurmd" breaks
>>     it.  See below:
>>
>>     [computelab-305:~]$ sudo systemctl restart slurmd
>>     [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
>>     index, name
>>     0, Tesla T4
>>     [computelab-305:~]$ sudo systemctl daemon-reload
>>     [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
>>     index, name
>>     0, Tesla T4
>>     [computelab-305:~]$ sudo systemctl restart slurmd
>>     [computelab-305:~]$ nvidia-smi --query-gpu=index,name --format=csv
>>     index, name
>>     0, Tesla T4
>>     1, Tesla T4
>>     2, Tesla T4
>>     3, Tesla T4
>>     4, Tesla T4
>>     5, Tesla T4
>>     6, Tesla T4
>>     7, Tesla T4
>>     [computelab-305:~]$ slurmd -V
>>     slurm 17.11.9-2
>>
>>
>>     On Wed, Apr 10, 2019 at 3:59 PM Kilian Cavalotti
>>     <kilian.cavalotti.work at gmail.com
>>     <mailto:kilian.cavalotti.work at gmail.com>> wrote:
>>
>>         Hi Randy!
>>
>>         > We have a slurm cluster with a number of nodes, some of
>>         which have more than one GPU.  Users select how many or which
>>         GPUs they want with srun's "--gres" option.  Nothing fancy
>>         here, and in general this works as expected.  But starting a
>>         few days ago we've had problems on one machine.  A specific
>>         user started a single-gpu session with srun, and nvidia-smi
>>         reported one GPU, as expected.  But about two hours later, he
>>         suddenly could see all GPUs with nvidia-smi.  To be clear,
>>         this is all from the iterative session provided by Slurm.  He
>>         did not ssh to the machine.  He's not running Docker. 
>>         Nothing odd as far as we can tell.
>>         >
>>         > A big problem is I've been unable to reproduce the
>>         problem.  I have confidence that what this user is telling me
>>         is correct, but I can't do much until/unless I can reproduce it.
>>
>>         I think this kind of behavior has already been reported a few
>>         times:
>>         https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html
>>         https://bugs.schedmd.com/show_bug.cgi?id=5300
>>
>>         As far as I can tell, it looks like this is probably systemd
>>         messing
>>         up with cgroups and deciding it's the king of cgroups on the
>>         host.
>>
>>         You'll find more context and details in
>>         https://bugs.schedmd.com/show_bug.cgi?id=5292
>>
>>         Cheers,
>>         -- 
>>         Kilian
>>
>
>     -- 
>     Marcus Wagner, Dipl.-Inf.
>
>     IT Center
>     Abteilung: Systeme und Betrieb
>     RWTH Aachen University
>     Seffenter Weg 23
>     52074 Aachen
>     Tel: +49 241 80-24383
>     Fax: +49 241 80-624383
>     wagner at itc.rwth-aachen.de  <mailto:wagner at itc.rwth-aachen.de>
>     www.itc.rwth-aachen.de  <http://www.itc.rwth-aachen.de>
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190411/11605aa6/attachment.html>


More information about the slurm-users mailing list