<div dir="ltr"><div>No problem! Glad it is working for you now.</div><div><br></div><div>Best,</div><div><br></div><div>-Sean<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 27, 2022 at 1:46 PM Dominik Baack <<a href="mailto:dominik.baack@cs.uni-dortmund.de">dominik.baack@cs.uni-dortmund.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thank you very much!</p>
<p>Those were the missing settings!<br>
</p>
<p></p>
<p>I am not sure how I overlooked it for nearly two days, but I am
happy that its working now.</p>
<p>Cheers<br>
Dominik Baack</p>
<p><br>
</p>
<div>Am 27.10.2022 um 19:23 schrieb Sean
Maxwell:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>It looks like you are missing some of the slurm.conf
entries related to enforcing the cgroup restrictions. I would
go through the list here and verify/adjust your configuration:</div>
<div><br>
</div>
<div><a href="https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf" target="_blank">https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf</a></div>
<div><br>
</div>
<div>Best,</div>
<div><br>
</div>
<div>-Sean<br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Oct 27, 2022 at 1:04
PM Dominik Baack <<a href="mailto:dominik.baack@cs.uni-dortmund.de" target="_blank">dominik.baack@cs.uni-dortmund.de</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi,</p>
<p>yes ContrainDevices is set:</p>
<p>###<br>
# Slurm cgroup support configuration file<br>
###<br>
CgroupAutomount=yes<br>
#<br>
#CgroupMountpoint="/sys/fs/cgroup"<br>
ConstrainCores=yes<br>
ConstrainDevices=yes<br>
ConstrainRAMSpace=yes<br>
#<br>
#</p>
<p>I attached the slurm configuration file as well<br>
</p>
<p>Cheers<br>
Dominik<br>
</p>
<div>Am 27.10.2022 um 17:57 schrieb Sean Maxwell:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hi Dominik,</div>
<div><br>
</div>
<div>Do you have ConstrainDevices=yes set in your
cgroup.conf?</div>
<div><br>
</div>
<div>Best,</div>
<div><br>
</div>
<div>-Sean<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Oct 27, 2022
at 11:49 AM Dominik Baack <<a href="mailto:dominik.baack@cs.uni-dortmund.de" target="_blank">dominik.baack@cs.uni-dortmund.de</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
We are in the process of setting up SLURM on some DGX
A100 nodes . We <br>
are experiencing the problem that all GPUs are
available for users, even <br>
for jobs where only one should be assigned.<br>
<br>
It seems the requirement is forwarded correctly to the
node, at least <br>
CUDA_VISIBLE_DEVICES is set to the correct id only
discarded by the rest <br>
of the system.<br>
<br>
Cheers<br>
Dominik Baack<br>
<br>
Example:<br>
<br>
baack@gwkilab:~$ srun --gpus=1 nvidia-smi<br>
Thu Oct 27 17:39:04 2022<br>
+-----------------------------------------------------------------------------+<br>
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03
CUDA Version: <br>
11.4 |<br>
|-------------------------------+----------------------+----------------------+<br>
| GPU Name Persistence-M| Bus-Id Disp.A
| Volatile <br>
Uncorr. ECC |<br>
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage
| GPU-Util <br>
Compute M. |<br>
| | | MIG
M. |<br>
|===============================+======================+======================|<br>
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off
<br>
| 0 |<br>
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
| 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off
<br>
| 0 |<br>
| N/A 28C P0 51W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
| 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off
<br>
| 0 |<br>
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
| 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off
<br>
| 0 |<br>
| N/A 29C P0 54W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
| 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off
<br>
| 0 |<br>
| N/A 34C P0 57W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
| 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off
<br>
| 0 |<br>
| N/A 31C P0 55W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
| 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off
<br>
| 0 |<br>
| N/A 31C P0 51W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
| 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off
<br>
| 0 |<br>
| N/A 32C P0 52W / 400W | 0MiB / 40536MiB
| 0% Default |<br>
| | |
Disabled |<br>
+-------------------------------+----------------------+----------------------+<br>
<br>
+-----------------------------------------------------------------------------+<br>
| Processes: |<br>
| GPU GI CI PID Type Process name GPU
Memory |<br>
| ID ID Usage |<br>
|=============================================================================|<br>
| No running processes <br>
found
|<br>
+-----------------------------------------------------------------------------+<br>
<br>
<br>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote></div>