<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Hi Randall,<br>

    <br>

    could you please for a test add the following lines to the service

    part of the slurmd.service file (or add an override file).<br>

    <br>

    Delegate=yes<br>

    <br>

    <br>

    Best<br>

    Marcus<br>

    <br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 4/11/19 3:11 PM, Randall Radmer

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAFOfDQaKH39STCQ03B94YFJz_5AY8z5eojLG=OFRxUmcA7f35A@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr">

          <div dir="ltr">It's now distressingly simple to reproduce

            this, based on Kilinan's clue (off topic, "Kilinan's Clue"

            sounds like a good title for a Hardy Boys Mystery Story).</div>

          <div dir="ltr"><br>

          </div>

          <div dir="ltr">After limited testing, seems to me that running

            "systemctl daemon-reload"  followed by "systemctl restart

            slurmd" breaks it.  See below:</div>

          <div dir="ltr">

            <div><br>

            </div>

            <div>

              <div>[computelab-305:~]$ sudo systemctl restart slurmd</div>

              <div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name

                --format=csv</div>

              <div>index, name</div>

              <div>0, Tesla T4</div>

              <div>[computelab-305:~]$ sudo systemctl daemon-reload</div>

              <div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name

                --format=csv</div>

              <div>index, name</div>

              <div>0, Tesla T4</div>

              <div>[computelab-305:~]$ sudo systemctl restart slurmd</div>

              <div>[computelab-305:~]$ nvidia-smi --query-gpu=index,name

                --format=csv</div>

              <div>index, name</div>

              <div>0, Tesla T4</div>

              <div>1, Tesla T4</div>

              <div>2, Tesla T4</div>

              <div>3, Tesla T4</div>

              <div>4, Tesla T4</div>

              <div>5, Tesla T4</div>

              <div>6, Tesla T4</div>

              <div>7, Tesla T4</div>

              <div>[computelab-305:~]$ slurmd -V</div>

              <div>slurm 17.11.9-2</div>

            </div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Wed, Apr 10, 2019 at 3:59

          PM Kilian Cavalotti <<a

            href="mailto:kilian.cavalotti.work@gmail.com"

            moz-do-not-send="true">kilian.cavalotti.work@gmail.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi

          Randy!<br>

          <br>

          > We have a slurm cluster with a number of nodes, some of

          which have more than one GPU.  Users select how many or which

          GPUs they want with srun's "--gres" option.  Nothing fancy

          here, and in general this works as expected.  But starting a

          few days ago we've had problems on one machine.  A specific

          user started a single-gpu session with srun, and nvidia-smi

          reported one GPU, as expected.  But about two hours later, he

          suddenly could see all GPUs with nvidia-smi.  To be clear,

          this is all from the iterative session provided by Slurm.  He

          did not ssh to the machine.  He's not running Docker.  Nothing

          odd as far as we can tell.<br>

          ><br>

          > A big problem is I've been unable to reproduce the

          problem.  I have confidence that what this user is telling me

          is correct, but I can't do much until/unless I can reproduce

          it.<br>

          <br>

          I think this kind of behavior has already been reported a few

          times:<br>

          <a

href="https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.schedmd.com/pipermail/slurm-users/2018-April/000885.html</a><br>

          <a href="https://bugs.schedmd.com/show_bug.cgi?id=5300"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5300</a><br>

          <br>

          As far as I can tell, it looks like this is probably systemd

          messing<br>

          up with cgroups and deciding it's the king of cgroups on the

          host.<br>

          <br>

          You'll find more context and details in<br>

          <a href="https://bugs.schedmd.com/show_bug.cgi?id=5292"

            rel="noreferrer" target="_blank" moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5292</a><br>

          <br>

          Cheers,<br>

          -- <br>

          Kilian<br>

          <br>

        </blockquote>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Marcus Wagner, Dipl.-Inf.

IT Center

Abteilung: Systeme und Betrieb

RWTH Aachen University

Seffenter Weg 23

52074 Aachen

Tel: +49 241 80-24383

Fax: +49 241 80-624383

<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>

<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>

</pre>

  </body>

</html>