<div dir="ltr"><div>Hi Dominik,</div><div><br></div><div>Do you have ConstrainDevices=yes set in your cgroup.conf?</div><div><br></div><div>Best,</div><div><br></div><div>-Sean<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack <<a href="mailto:dominik.baack@cs.uni-dortmund.de">dominik.baack@cs.uni-dortmund.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
We are in the process of setting up SLURM on some DGX A100 nodes . We <br>
are experiencing the problem that all GPUs are available for users, even <br>
for jobs where only one should be assigned.<br>
<br>
It seems the requirement is forwarded correctly to the node, at least <br>
CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest <br>
of the system.<br>
<br>
Cheers<br>
Dominik Baack<br>
<br>
Example:<br>
<br>
baack@gwkilab:~$ srun --gpus=1 nvidia-smi<br>
Thu Oct 27 17:39:04 2022<br>
+-----------------------------------------------------------------------------+<br>
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: <br>
11.4     |<br>
|-------------------------------+----------------------+----------------------+<br>
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile <br>
Uncorr. ECC |<br>
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util <br>
Compute M. |<br>
|                               | |               MIG M. |<br>
|===============================+======================+======================|<br>
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off <br>
|                    0 |<br>
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off <br>
|                    0 |<br>
| N/A   28C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off <br>
|                    0 |<br>
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off <br>
|                    0 |<br>
| N/A   29C    P0    54W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off <br>
|                    0 |<br>
| N/A   34C    P0    57W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off <br>
|                    0 |<br>
| N/A   31C    P0    55W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off <br>
|                    0 |<br>
| N/A   31C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off <br>
|                    0 |<br>
| N/A   32C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |<br>
|                               | |             Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
<br>
+-----------------------------------------------------------------------------+<br>
| Processes: |<br>
|  GPU   GI   CI        PID   Type   Process name GPU Memory |<br>
|        ID   ID Usage      |<br>
|=============================================================================|<br>
|  No running processes <br>
found                                                 |<br>
+-----------------------------------------------------------------------------+<br>
<br>
<br>
</blockquote></div>