<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi Randall,<br>
<br>
could you please for a test add the following lines to the service
part of the slurmd.service file (or add an override file).<br>
<br>
Delegate=yes<br>
<br>
<br>
Best<br>
Marcus<br>
<br>
<br>
<br>
<div class="moz-cite-prefix">On 4/11/19 3:11 PM, Randall Radmer
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAFOfDQaKH39STCQ03B94YFJz_5AY8z5eojLG=OFRxUmcA7f35A@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">It's now distressingly simple to reproduce
this, based on Kilinan's clue (off topic, "Kilinan's Clue"
sounds like a good title for a Hardy Boys Mystery Story).</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">After limited testing, seems to me that running
"systemctl daemon-reload" followed by "systemctl restart
slurmd" breaks it. See below:</div>
<div dir="ltr">
<div><br>
</div>
<div>
<div>[computelab-305:~]$ sudo systemctl restart slurmd</div>
<div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>[computelab-305:~]$ sudo systemctl daemon-reload</div>
<div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>[computelab-305:~]$ sudo systemctl restart slurmd</div>
<div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name
--format=csv</div>
<div>index, name</div>
<div>0, Tesla T4</div>
<div>1, Tesla T4</div>
<div>2, Tesla T4</div>
<div>3, Tesla T4</div>
<div>4, Tesla T4</div>
<div>5, Tesla T4</div>
<div>6, Tesla T4</div>
<div>7, Tesla T4</div>
<div>[computelab-305:~]$ slurmd -V</div>
<div>slurm 17.11.9-2</div>
</div>
<div><br>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019 at 3:59
PM Kilian Cavalotti <<a
href="mailto:kilian.cavalotti.work@gmail.com"
moz-do-not-send="true">kilian.cavalotti.work@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi
Randy!<br>
<br>
> We have a slurm cluster with a number of nodes, some of
which have more than one GPU. Users select how many or which
GPUs they want with srun's "--gres" option. Nothing fancy
here, and in general this works as expected. But starting a
few days ago we've had problems on one machine. A specific
user started a single-gpu session with srun, and nvidia-smi
reported one GPU, as expected. But about two hours later, he
suddenly could see all GPUs with nvidia-smi. To be clear,
this is all from the iterative session provided by Slurm. He
did not ssh to the machine. He's not running Docker. Nothing
odd as far as we can tell.<br>
><br>
> A big problem is I've been unable to reproduce the
problem. I have confidence that what this user is telling me
is correct, but I can't do much until/unless I can reproduce
it.<br>
<br>
I think this kind of behavior has already been reported a few
times:<br>
<a
href="https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html</a><br>
<a href="https://bugs.schedmd.com/show_bug.cgi?id=5300"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5300</a><br>
<br>
As far as I can tell, it looks like this is probably systemd
messing<br>
up with cgroups and deciding it's the king of cgroups on the
host.<br>
<br>
You'll find more context and details in<br>
<a href="https://bugs.schedmd.com/show_bug.cgi?id=5292"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5292</a><br>
<br>
Cheers,<br>
-- <br>
Kilian<br>
<br>
</blockquote>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
</pre>
</body>
</html>