<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Thanks Christoph and others for the help.</p>

    <p>Turns out it is very simply setting cgroups that I had most of

      the way set months ago and even left myself a note to uncomment

      ConstrainDevices=yes in cgroup.conf when the GPU systems came

      online.</p>

    <p>Kept racking my brain why the gres settings didn't include

      anything while it would set the number of requested GPUs

      correctly.</p>

    <p>Everything is working as expected now.<br>

    </p>

    <div class="moz-signature">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <title></title>

      <table cellspacing="0" cellpadding="0" border="0">

        <tbody>

          <tr>

            <td width="150" valign="top" height="30" align="left">

              <p style="font-size:14px;">Willy Markuske</p>

            </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">HPC Systems Engineer</p>

            </td>

            <td rowspan="3" width="180" valign="center" height="42"

              align="center"><tt><img moz-do-not-send="false"

                  src="cid:part1.81E99EA1.445EFD52@sdsc.edu" alt=""

                  width="168" height="48"></tt> </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">Research Data Services</p>

            </td>

          </tr>

          <tr>

            <td style="border-right: 1px solid #000;" align="left">

              <p style="font-size:12px;">P: (858) 246-5593</p>

            </td>

          </tr>

        </tbody>

      </table>

      <p> </p>

    </div>

    <div class="moz-cite-prefix">On 8/25/20 8:24 AM, Christoph Brüning

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:f026a81e-0bcf-e3b7-17fb-28ed62eb0bd1@uni-wuerzburg.de">Hello,

      <br>

      <br>

      we're using cgroups to restrict access to the GPUs.

      <br>

      <br>

      What I found particularly helpful, are the slides by Marshall

      Garey from last year's Slurm User Group Meeting:

<a class="moz-txt-link-freetext" href="https://urldefense.com/v3/__https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf__;!!Mih3wA!XNe605WUGPer00S7oSxp5Vkj06UAdkDNiE-hhGSr9HvCBjneYA_8p1C12xnCD17p$">https://urldefense.com/v3/__https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf__;!!Mih3wA!XNe605WUGPer00S7oSxp5Vkj06UAdkDNiE-hhGSr9HvCBjneYA_8p1C12xnCD17p$</a>

      (NVML didn't work for us for some reason I cannot recall, but

      listing the GPU device files explicitly was not a big deal)

      <br>

      <br>

      Best,

      <br>

      Christoph

      <br>

      <br>

      <br>

      On 25/08/2020 16.12, Willy Markuske wrote:

      <br>

      <blockquote type="cite">Hello,

        <br>

        <br>

        I'm trying to restrict access to gpu resources on a cluster I

        maintain for a research group. There are two nodes put into a

        partition with gres gpu resources defined. User can access these

        resources by submitting their job under the gpu partition and

        defining a gres=gpu.

        <br>

        <br>

        When a user includes the flag --gres=gpu:# they are allocated

        the number of gpus and slurm properly allocates them. If a user

        requests only 1 gpu they only see CUDA_VISIBLE_DEVICES=1.

        However, if a user does not include the --gres=gpu:# flag they

        can still submit a job to the partition and are then able to see

        all the GPUs. This has led to some bad actors running jobs on

        all GPUs that other users have allocated and causing OOM errors

        on the gpus.

        <br>

        <br>

        Is it possible, and where would I find the documentation on

        doing so, to require users to define a --gres=gpu:# to be able

        to submit to a partition? So far reading the gres documentation

        doesn't seem to have yielded any word on this issue

        specifically.

        <br>

        <br>

        Regards,

        <br>

        <br>

        -- <br>

        <br>

        Willy Markuske

        <br>

        <br>

        HPC Systems Engineer

        <br>

        <br>

        <br>

        <br>

        Research Data Services

        <br>

        <br>

        P: (858) 246-5593

        <br>

        <br>

      </blockquote>

      <br>

    </blockquote>

  </body>

</html>