<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    I assume without Delegate=yes this would happen also to regular
    jobs, which means, nightly updates could "destroy" the cgroups
    created by slurm and therefore let the jobs out "into the wild".<br>
    <br>
    Best<br>
    Marcus<br>
    <br>
    P.S.:<br>
    We had a similar problem with LSF<br>
    <br>
    <div class="moz-cite-prefix">On 4/11/19 3:58 PM, Randall Radmer
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAFOfDQbrBphB-+9rwwD55fUTg4Vcg5RhCMMJ8YsA0RPs6d1PQQ@mail.gmail.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr">Yes, I was just testing that.  Adding
          "Delegate=yes" seems to fix the problem (see below), but
          wanted to try a few more things before saying anything.</div>
        <div dir="ltr">
          <div><br>
          </div>
          <div>
            <div>[computelab-136:~]$ grep ^Delegate
              /etc/systemd/system/slurmd.service</div>
            <div>Delegate=yes</div>
            <div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name
              --format=csv</div>
            <div>index, name</div>
            <div>0, Tesla T4</div>
            <div>[computelab-136:~]$ sudo systemctl daemon-reload; sudo
              systemctl restart slurmd</div>
            <div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name
              --format=csv</div>
            <div>index, name</div>
            <div>0, Tesla T4</div>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Thu, Apr 11, 2019 at 7:53
          AM Marcus Wagner <<a
            href="mailto:wagner@itc.rwth-aachen.de"
            moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div bgcolor="#FFFFFF"> Hi Randall,<br>
            <br>
            could you please for a test add the following lines to the
            service part of the slurmd.service file (or add an override
            file).<br>
            <br>
            Delegate=yes<br>
            <br>
            <br>
            Best<br>
            Marcus<br>
            <br>
            <br>
            <br>
            <div class="gmail-m_-5346374440321707734moz-cite-prefix">On
              4/11/19 3:11 PM, Randall Radmer wrote:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div dir="ltr">
                  <div dir="ltr">It's now distressingly simple to
                    reproduce this, based on Kilinan's clue (off topic,
                    "Kilinan's Clue" sounds like a good title for
                    a Hardy Boys Mystery Story).</div>
                  <div dir="ltr"><br>
                  </div>
                  <div dir="ltr">After limited testing, seems to me that
                    running "systemctl daemon-reload"  followed by
                    "systemctl restart slurmd" breaks it.  See below:</div>
                  <div dir="ltr">
                    <div><br>
                    </div>
                    <div>
                      <div>[computelab-305:~]$ sudo systemctl restart
                        slurmd</div>
                      <div>[computelab-305:~]$ nvidia-smi
                        --query-gpu=index,name --format=csv</div>
                      <div>index, name</div>
                      <div>0, Tesla T4</div>
                      <div>[computelab-305:~]$ sudo systemctl
                        daemon-reload</div>
                      <div>[computelab-305:~]$ nvidia-smi
                        --query-gpu=index,name --format=csv</div>
                      <div>index, name</div>
                      <div>0, Tesla T4</div>
                      <div>[computelab-305:~]$ sudo systemctl restart
                        slurmd</div>
                      <div>[computelab-305:~]$ nvidia-smi
                        --query-gpu=index,name --format=csv</div>
                      <div>index, name</div>
                      <div>0, Tesla T4</div>
                      <div>1, Tesla T4</div>
                      <div>2, Tesla T4</div>
                      <div>3, Tesla T4</div>
                      <div>4, Tesla T4</div>
                      <div>5, Tesla T4</div>
                      <div>6, Tesla T4</div>
                      <div>7, Tesla T4</div>
                      <div>[computelab-305:~]$ slurmd -V</div>
                      <div>slurm 17.11.9-2</div>
                    </div>
                    <div><br>
                    </div>
                  </div>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019
                  at 3:59 PM Kilian Cavalotti <<a
                    href="mailto:kilian.cavalotti.work@gmail.com"
                    target="_blank" moz-do-not-send="true">kilian.cavalotti.work@gmail.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">Hi Randy!<br>
                  <br>
                  > We have a slurm cluster with a number of nodes,
                  some of which have more than one GPU.  Users select
                  how many or which GPUs they want with srun's "--gres"
                  option.  Nothing fancy here, and in general this works
                  as expected.  But starting a few days ago we've had
                  problems on one machine.  A specific user started a
                  single-gpu session with srun, and nvidia-smi reported
                  one GPU, as expected.  But about two hours later, he
                  suddenly could see all GPUs with nvidia-smi.  To be
                  clear, this is all from the iterative session provided
                  by Slurm.  He did not ssh to the machine.  He's not
                  running Docker.  Nothing odd as far as we can tell.<br>
                  ><br>
                  > A big problem is I've been unable to reproduce
                  the problem.  I have confidence that what this user is
                  telling me is correct, but I can't do much
                  until/unless I can reproduce it.<br>
                  <br>
                  I think this kind of behavior has already been
                  reported a few times:<br>
                  <a
href="https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html"
                    rel="noreferrer" target="_blank"
                    moz-do-not-send="true">https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html</a><br>
                  <a
                    href="https://bugs.schedmd.com/show_bug.cgi?id=5300"
                    rel="noreferrer" target="_blank"
                    moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5300</a><br>
                  <br>
                  As far as I can tell, it looks like this is probably
                  systemd messing<br>
                  up with cgroups and deciding it's the king of cgroups
                  on the host.<br>
                  <br>
                  You'll find more context and details in<br>
                  <a
                    href="https://bugs.schedmd.com/show_bug.cgi?id=5292"
                    rel="noreferrer" target="_blank"
                    moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5292</a><br>
                  <br>
                  Cheers,<br>
                  -- <br>
                  Kilian<br>
                  <br>
                </blockquote>
              </div>
            </blockquote>
            <br>
            <pre class="gmail-m_-5346374440321707734moz-signature" cols="72">-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>
<a class="gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">www.itc.rwth-aachen.de</a>
</pre>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
</pre>
  </body>
</html>