<div dir="ltr"><div dir="ltr"><div dir="ltr">I guess my next question is, are there any negative repercussions to setting "Delegate=yes" in slurmd.service?</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Apr 11, 2019 at 8:21 AM Marcus Wagner <<a href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
I assume without Delegate=yes this would happen also to regular
jobs, which means, nightly updates could "destroy" the cgroups
created by slurm and therefore let the jobs out "into the wild".<br>
<br>
Best<br>
Marcus<br>
<br>
P.S.:<br>
We had a similar problem with LSF<br>
<br>
<div class="gmail-m_5221216371352740225moz-cite-prefix">On 4/11/19 3:58 PM, Randall Radmer
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">Yes, I was just testing that. Adding
"Delegate=yes" seems to fix the problem (see below), but
wanted to try a few more things before saying anything.</div>
<div dir="ltr">
<div><br>
</div>
<div>
<div>[computelab-136:~]$ grep ^Delegate
/etc/systemd/system/slurmd.service</div>
<div>Delegate=yes</div>
<div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>[computelab-136:~]$ sudo systemctl daemon-reload; sudo
systemctl restart slurmd</div>
<div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Apr 11, 2019 at 7:53
AM Marcus Wagner <<a href="mailto:wagner@itc.rwth-aachen.de" target="_blank">wagner@itc.rwth-aachen.de</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> Hi Randall,<br>
<br>
could you please for a test add the following lines to the
service part of the slurmd.service file (or add an override
file).<br>
<br>
Delegate=yes<br>
<br>
<br>
Best<br>
Marcus<br>
<br>
<br>
<br>
<div class="gmail-m_5221216371352740225gmail-m_-5346374440321707734moz-cite-prefix">On
4/11/19 3:11 PM, Randall Radmer wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">It's now distressingly simple to
reproduce this, based on Kilinan's clue (off topic,
"Kilinan's Clue" sounds like a good title for
a Hardy Boys Mystery Story).</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">After limited testing, seems to me that
running "systemctl daemon-reload" followed by
"systemctl restart slurmd" breaks it. See below:</div>
<div dir="ltr">
<div><br>
</div>
<div>
<div>[computelab-305:~]$ sudo systemctl restart
slurmd</div>
<div>[computelab-305:~]$ nvidia-smi
--query-gpu=index,name --format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>[computelab-305:~]$ sudo systemctl
daemon-reload</div>
<div>[computelab-305:~]$ nvidia-smi
--query-gpu=index,name --format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>[computelab-305:~]$ sudo systemctl restart
slurmd</div>
<div>[computelab-305:~]$ nvidia-smi
--query-gpu=index,name --format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>1, Tesla T4</div>
<div>2, Tesla T4</div>
<div>3, Tesla T4</div>
<div>4, Tesla T4</div>
<div>5, Tesla T4</div>
<div>6, Tesla T4</div>
<div>7, Tesla T4</div>
<div>[computelab-305:~]$ slurmd -V</div>
<div>slurm 17.11.9-2</div>
</div>
<div><br>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019
at 3:59 PM Kilian Cavalotti <<a href="mailto:kilian.cavalotti.work@gmail.com" target="_blank">kilian.cavalotti.work@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Randy!<br>
<br>
> We have a slurm cluster with a number of nodes,
some of which have more than one GPU. Users select
how many or which GPUs they want with srun's "--gres"
option. Nothing fancy here, and in general this works
as expected. But starting a few days ago we've had
problems on one machine. A specific user started a
single-gpu session with srun, and nvidia-smi reported
one GPU, as expected. But about two hours later, he
suddenly could see all GPUs with nvidia-smi. To be
clear, this is all from the iterative session provided
by Slurm. He did not ssh to the machine. He's not
running Docker. Nothing odd as far as we can tell.<br>
><br>
> A big problem is I've been unable to reproduce
the problem. I have confidence that what this user is
telling me is correct, but I can't do much
until/unless I can reproduce it.<br>
<br>
I think this kind of behavior has already been
reported a few times:<br>
<a href="https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html" rel="noreferrer" target="_blank">https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html</a><br>
<a href="https://bugs.schedmd.com/show_bug.cgi?id=5300" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=5300</a><br>
<br>
As far as I can tell, it looks like this is probably
systemd messing<br>
up with cgroups and deciding it's the king of cgroups
on the host.<br>
<br>
You'll find more context and details in<br>
<a href="https://bugs.schedmd.com/show_bug.cgi?id=5292" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=5292</a><br>
<br>
Cheers,<br>
-- <br>
Kilian<br>
<br>
</blockquote>
</div>
</blockquote>
<br>
<pre class="gmail-m_5221216371352740225gmail-m_-5346374440321707734moz-signature" cols="72">--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="gmail-m_5221216371352740225gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de" target="_blank">wagner@itc.rwth-aachen.de</a>
<a class="gmail-m_5221216371352740225gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de" target="_blank">www.itc.rwth-aachen.de</a>
</pre>
</div>
</blockquote>
</div>
</blockquote>
<br>
<pre class="gmail-m_5221216371352740225moz-signature" cols="72">--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="gmail-m_5221216371352740225moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de" target="_blank">wagner@itc.rwth-aachen.de</a>
<a class="gmail-m_5221216371352740225moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de" target="_blank">www.itc.rwth-aachen.de</a>
</pre>
</div>
</blockquote></div>