<div dir="ltr"><div dir="ltr">Yes, I was just testing that. Adding "Delegate=yes" seems to fix the problem (see below), but wanted to try a few more things before saying anything.</div><div dir="ltr"><div><br></div><div><div>[computelab-136:~]$ grep ^Delegate /etc/systemd/system/slurmd.service</div><div>Delegate=yes</div><div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv</div><div>index, name</div><div>0, Tesla T4</div><div>[computelab-136:~]$ sudo systemctl daemon-reload; sudo systemctl restart slurmd</div><div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name --format=csv</div><div>index, name</div><div>0, Tesla T4</div></div><div><br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Apr 11, 2019 at 7:53 AM Marcus Wagner <<a href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
Hi Randall,<br>
<br>
could you please for a test add the following lines to the service
part of the slurmd.service file (or add an override file).<br>
<br>
Delegate=yes<br>
<br>
<br>
Best<br>
Marcus<br>
<br>
<br>
<br>
<div class="gmail-m_-5346374440321707734moz-cite-prefix">On 4/11/19 3:11 PM, Randall Radmer
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">It's now distressingly simple to reproduce
this, based on Kilinan's clue (off topic, "Kilinan's Clue"
sounds like a good title for a Hardy Boys Mystery Story).</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">After limited testing, seems to me that running
"systemctl daemon-reload" followed by "systemctl restart
slurmd" breaks it. See below:</div>
<div dir="ltr">
<div><br>
</div>
<div>
<div>[computelab-305:~]$ sudo systemctl restart slurmd</div>
<div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>[computelab-305:~]$ sudo systemctl daemon-reload</div>
<div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>[computelab-305:~]$ sudo systemctl restart slurmd</div>
<div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>1, Tesla T4</div>
<div>2, Tesla T4</div>
<div>3, Tesla T4</div>
<div>4, Tesla T4</div>
<div>5, Tesla T4</div>
<div>6, Tesla T4</div>
<div>7, Tesla T4</div>
<div>[computelab-305:~]$ slurmd -V</div>
<div>slurm 17.11.9-2</div>
</div>
<div><br>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019 at 3:59
PM Kilian Cavalotti <<a href="mailto:kilian.cavalotti.work@gmail.com" target="_blank">kilian.cavalotti.work@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi
Randy!<br>
<br>
> We have a slurm cluster with a number of nodes, some of
which have more than one GPU. Users select how many or which
GPUs they want with srun's "--gres" option. Nothing fancy
here, and in general this works as expected. But starting a
few days ago we've had problems on one machine. A specific
user started a single-gpu session with srun, and nvidia-smi
reported one GPU, as expected. But about two hours later, he
suddenly could see all GPUs with nvidia-smi. To be clear,
this is all from the iterative session provided by Slurm. He
did not ssh to the machine. He's not running Docker. Nothing
odd as far as we can tell.<br>
><br>
> A big problem is I've been unable to reproduce the
problem. I have confidence that what this user is telling me
is correct, but I can't do much until/unless I can reproduce
it.<br>
<br>
I think this kind of behavior has already been reported a few
times:<br>
<a href="https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html" rel="noreferrer" target="_blank">https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html</a><br>
<a href="https://bugs.schedmd.com/show_bug.cgi?id=5300" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=5300</a><br>
<br>
As far as I can tell, it looks like this is probably systemd
messing<br>
up with cgroups and deciding it's the king of cgroups on the
host.<br>
<br>
You'll find more context and details in<br>
<a href="https://bugs.schedmd.com/show_bug.cgi?id=5292" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=5292</a><br>
<br>
Cheers,<br>
-- <br>
Kilian<br>
<br>
</blockquote>
</div>
</blockquote>
<br>
<pre class="gmail-m_-5346374440321707734moz-signature" cols="72">--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de" target="_blank">wagner@itc.rwth-aachen.de</a>
<a class="gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de" target="_blank">www.itc.rwth-aachen.de</a>
</pre>
</div>
</blockquote></div>