<div dir="ltr"><div>No problem! Glad it is working for you now.</div><div><br></div><div>Best,</div><div><br></div><div>-Sean<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 27, 2022 at 1:46 PM Dominik Baack <<a href="mailto:dominik.baack@cs.uni-dortmund.de">dominik.baack@cs.uni-dortmund.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>Thank you very much!</p>
    <p>Those were the missing settings!<br>
    </p>
    <p></p>
    <p>I am not sure how I overlooked it for nearly two days, but I am
      happy that its working now.</p>
    <p>Cheers<br>
      Dominik Baack</p>
    <p><br>
    </p>
    <div>Am 27.10.2022 um 19:23 schrieb Sean
      Maxwell:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div>It looks like you are missing some of the slurm.conf
          entries related to enforcing the cgroup restrictions. I would
          go through the list here and verify/adjust your configuration:</div>
        <div><br>
        </div>
        <div><a href="https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf" target="_blank">https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf</a></div>
        <div><br>
        </div>
        <div>Best,</div>
        <div><br>
        </div>
        <div>-Sean<br>
        </div>
        <div><br>
        </div>
        <div><br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Thu, Oct 27, 2022 at 1:04
          PM Dominik Baack <<a href="mailto:dominik.baack@cs.uni-dortmund.de" target="_blank">dominik.baack@cs.uni-dortmund.de</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>Hi,</p>
            <p>yes ContrainDevices is set:</p>
            <p>###<br>
              # Slurm cgroup support configuration file<br>
              ###<br>
              CgroupAutomount=yes<br>
              #<br>
              #CgroupMountpoint="/sys/fs/cgroup"<br>
              ConstrainCores=yes<br>
              ConstrainDevices=yes<br>
              ConstrainRAMSpace=yes<br>
              #<br>
              #</p>
            <p>I attached the slurm configuration file as well<br>
            </p>
            <p>Cheers<br>
              Dominik<br>
            </p>
            <div>Am 27.10.2022 um 17:57 schrieb Sean Maxwell:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div>Hi Dominik,</div>
                <div><br>
                </div>
                <div>Do you have ConstrainDevices=yes set in your
                  cgroup.conf?</div>
                <div><br>
                </div>
                <div>Best,</div>
                <div><br>
                </div>
                <div>-Sean<br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Thu, Oct 27, 2022
                  at 11:49 AM Dominik Baack <<a href="mailto:dominik.baack@cs.uni-dortmund.de" target="_blank">dominik.baack@cs.uni-dortmund.de</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
                  <br>
                  We are in the process of setting up SLURM on some DGX
                  A100 nodes . We <br>
                  are experiencing the problem that all GPUs are
                  available for users, even <br>
                  for jobs where only one should be assigned.<br>
                  <br>
                  It seems the requirement is forwarded correctly to the
                  node, at least <br>
                  CUDA_VISIBLE_DEVICES is set to the correct id only
                  discarded by the rest <br>
                  of the system.<br>
                  <br>
                  Cheers<br>
                  Dominik Baack<br>
                  <br>
                  Example:<br>
                  <br>
                  baack@gwkilab:~$ srun --gpus=1 nvidia-smi<br>
                  Thu Oct 27 17:39:04 2022<br>
+-----------------------------------------------------------------------------+<br>
                  | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03  
                  CUDA Version: <br>
                  11.4     |<br>
|-------------------------------+----------------------+----------------------+<br>
                  | GPU  Name        Persistence-M| Bus-Id        Disp.A
                  | Volatile <br>
                  Uncorr. ECC |<br>
                  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage
                  | GPU-Util <br>
                  Compute M. |<br>
                  |                               | |               MIG
                  M. |<br>
|===============================+======================+======================|<br>
                  |   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  |   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   28C    P0    51W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  |   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  |   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   29C    P0    54W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  |   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   34C    P0    57W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  |   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   31C    P0    55W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  |   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   31C    P0    51W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  |   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off
                  <br>
                  |                    0 |<br>
                  | N/A   32C    P0    52W / 400W |      0MiB / 40536MiB
                  | 0%      Default |<br>
                  |                               | |            
                  Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
                  <br>
+-----------------------------------------------------------------------------+<br>
                  | Processes: |<br>
                  |  GPU   GI   CI        PID   Type   Process name GPU
                  Memory |<br>
                  |        ID   ID Usage      |<br>
|=============================================================================|<br>
                  |  No running processes <br>
                  found                                                
                  |<br>
+-----------------------------------------------------------------------------+<br>
                  <br>
                  <br>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </div>

</blockquote></div>