<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    I assume without Delegate=yes this would happen also to regular

    jobs, which means, nightly updates could "destroy" the cgroups

    created by slurm and therefore let the jobs out "into the wild".<br>

    <br>

    Best<br>

    Marcus<br>

    <br>

    P.S.:<br>

    We had a similar problem with LSF<br>

    <br>

    <div class="moz-cite-prefix">On 4/11/19 3:58 PM, Randall Radmer

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAFOfDQbrBphB-+9rwwD55fUTg4Vcg5RhCMMJ8YsA0RPs6d1PQQ@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr">Yes, I was just testing that.  Adding

          "Delegate=yes" seems to fix the problem (see below), but

          wanted to try a few more things before saying anything.</div>

        <div dir="ltr">

          <div><br>

          </div>

          <div>

            <div>[computelab-136:~]$ grep ^Delegate

              /etc/systemd/system/slurmd.service</div>

            <div>Delegate=yes</div>

            <div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name

              --format=csv</div>

            <div>index, name</div>

            <div>0, Tesla T4</div>

            <div>[computelab-136:~]$ sudo systemctl daemon-reload; sudo

              systemctl restart slurmd</div>

            <div>[computelab-136:~]$ nvidia-smi --query-gpu=index,name

              --format=csv</div>

            <div>index, name</div>

            <div>0, Tesla T4</div>

          </div>

          <div><br>

          </div>

          <div><br>

          </div>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Thu, Apr 11, 2019 at 7:53

          AM Marcus Wagner <<a

            href="mailto:wagner@itc.rwth-aachen.de"

            moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div bgcolor="#FFFFFF"> Hi Randall,<br>

            <br>

            could you please for a test add the following lines to the

            service part of the slurmd.service file (or add an override

            file).<br>

            <br>

            Delegate=yes<br>

            <br>

            <br>

            Best<br>

            Marcus<br>

            <br>

            <br>

            <br>

            <div class="gmail-m_-5346374440321707734moz-cite-prefix">On

              4/11/19 3:11 PM, Randall Radmer wrote:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">

                <div dir="ltr">

                  <div dir="ltr">It's now distressingly simple to

                    reproduce this, based on Kilinan's clue (off topic,

                    "Kilinan's Clue" sounds like a good title for

                    a Hardy Boys Mystery Story).</div>

                  <div dir="ltr"><br>

                  </div>

                  <div dir="ltr">After limited testing, seems to me that

                    running "systemctl daemon-reload"  followed by

                    "systemctl restart slurmd" breaks it.  See below:</div>

                  <div dir="ltr">

                    <div><br>

                    </div>

                    <div>

                      <div>[computelab-305:~]$ sudo systemctl restart

                        slurmd</div>

                      <div>[computelab-305:~]$ nvidia-smi

                        --query-gpu=index,name --format=csv</div>

                      <div>index, name</div>

                      <div>0, Tesla T4</div>

                      <div>[computelab-305:~]$ sudo systemctl

                        daemon-reload</div>

                      <div>[computelab-305:~]$ nvidia-smi

                        --query-gpu=index,name --format=csv</div>

                      <div>index, name</div>

                      <div>0, Tesla T4</div>

                      <div>[computelab-305:~]$ sudo systemctl restart

                        slurmd</div>

                      <div>[computelab-305:~]$ nvidia-smi

                        --query-gpu=index,name --format=csv</div>

                      <div>index, name</div>

                      <div>0, Tesla T4</div>

                      <div>1, Tesla T4</div>

                      <div>2, Tesla T4</div>

                      <div>3, Tesla T4</div>

                      <div>4, Tesla T4</div>

                      <div>5, Tesla T4</div>

                      <div>6, Tesla T4</div>

                      <div>7, Tesla T4</div>

                      <div>[computelab-305:~]$ slurmd -V</div>

                      <div>slurm 17.11.9-2</div>

                    </div>

                    <div><br>

                    </div>

                  </div>

                </div>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019

                  at 3:59 PM Kilian Cavalotti <<a

                    href="mailto:kilian.cavalotti.work@gmail.com"

                    target="_blank" moz-do-not-send="true">kilian.cavalotti.work@gmail.com</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px

                  0px 0.8ex;border-left:1px solid

                  rgb(204,204,204);padding-left:1ex">Hi Randy!<br>

                  <br>

                  > We have a slurm cluster with a number of nodes,

                  some of which have more than one GPU.  Users select

                  how many or which GPUs they want with srun's "--gres"

                  option.  Nothing fancy here, and in general this works

                  as expected.  But starting a few days ago we've had

                  problems on one machine.  A specific user started a

                  single-gpu session with srun, and nvidia-smi reported

                  one GPU, as expected.  But about two hours later, he

                  suddenly could see all GPUs with nvidia-smi.  To be

                  clear, this is all from the iterative session provided

                  by Slurm.  He did not ssh to the machine.  He's not

                  running Docker.  Nothing odd as far as we can tell.<br>

                  ><br>

                  > A big problem is I've been unable to reproduce

                  the problem.  I have confidence that what this user is

                  telling me is correct, but I can't do much

                  until/unless I can reproduce it.<br>

                  <br>

                  I think this kind of behavior has already been

                  reported a few times:<br>

                  <a

href="https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html"

                    rel="noreferrer" target="_blank"

                    moz-do-not-send="true">https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html</a><br>

                  <a

                    href="https://bugs.schedmd.com/show_bug.cgi?id=5300"

                    rel="noreferrer" target="_blank"

                    moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5300</a><br>

                  <br>

                  As far as I can tell, it looks like this is probably

                  systemd messing<br>

                  up with cgroups and deciding it's the king of cgroups

                  on the host.<br>

                  <br>

                  You'll find more context and details in<br>

                  <a

                    href="https://bugs.schedmd.com/show_bug.cgi?id=5292"

                    rel="noreferrer" target="_blank"

                    moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5292</a><br>

                  <br>

                  Cheers,<br>

                  -- <br>

                  Kilian<br>

                  <br>

                </blockquote>

              </div>

            </blockquote>

            <br>

            <pre class="gmail-m_-5346374440321707734moz-signature" cols="72">-- 

Marcus Wagner, Dipl.-Inf.

IT Center

Abteilung: Systeme und Betrieb

RWTH Aachen University

Seffenter Weg 23

52074 Aachen

Tel: +49 241 80-24383

Fax: +49 241 80-624383

<a class="gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">wagner@itc.rwth-aachen.de</a>

<a class="gmail-m_-5346374440321707734moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de" target="_blank" moz-do-not-send="true">www.itc.rwth-aachen.de</a>

</pre>

          </div>

        </blockquote>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Marcus Wagner, Dipl.-Inf.

IT Center

Abteilung: Systeme und Betrieb

RWTH Aachen University

Seffenter Weg 23

52074 Aachen

Tel: +49 241 80-24383

Fax: +49 241 80-624383

<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>

<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>

</pre>

  </body>

</html>