[slurm-users] GPU Allocation does not limit number of available GPUs in job

Dominik Baack dominik.baack at cs.uni-dortmund.de
Thu Oct 27 17:46:21 UTC 2022


Thank you very much!

Those were the missing settings!

I am not sure how I overlooked it for nearly two days, but I am happy 
that its working now.

Cheers
Dominik Baack


Am 27.10.2022 um 19:23 schrieb Sean Maxwell:
> It looks like you are missing some of the slurm.conf entries related 
> to enforcing the cgroup restrictions. I would go through the list here 
> and verify/adjust your configuration:
>
> https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf
>
> Best,
>
> -Sean
>
>
>
> On Thu, Oct 27, 2022 at 1:04 PM Dominik Baack 
> <dominik.baack at cs.uni-dortmund.de> wrote:
>
>     Hi,
>
>     yes ContrainDevices is set:
>
>     ###
>     # Slurm cgroup support configuration file
>     ###
>     CgroupAutomount=yes
>     #
>     #CgroupMountpoint="/sys/fs/cgroup"
>     ConstrainCores=yes
>     ConstrainDevices=yes
>     ConstrainRAMSpace=yes
>     #
>     #
>
>     I attached the slurm configuration file as well
>
>     Cheers
>     Dominik
>
>     Am 27.10.2022 um 17:57 schrieb Sean Maxwell:
>>     Hi Dominik,
>>
>>     Do you have ConstrainDevices=yes set in your cgroup.conf?
>>
>>     Best,
>>
>>     -Sean
>>
>>     On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack
>>     <dominik.baack at cs.uni-dortmund.de> wrote:
>>
>>         Hi,
>>
>>         We are in the process of setting up SLURM on some DGX A100
>>         nodes . We
>>         are experiencing the problem that all GPUs are available for
>>         users, even
>>         for jobs where only one should be assigned.
>>
>>         It seems the requirement is forwarded correctly to the node,
>>         at least
>>         CUDA_VISIBLE_DEVICES is set to the correct id only discarded
>>         by the rest
>>         of the system.
>>
>>         Cheers
>>         Dominik Baack
>>
>>         Example:
>>
>>         baack at gwkilab:~$ srun --gpus=1 nvidia-smi
>>         Thu Oct 27 17:39:04 2022
>>         +-----------------------------------------------------------------------------+
>>         | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03 CUDA
>>         Version:
>>         11.4     |
>>         |-------------------------------+----------------------+----------------------+
>>         | GPU  Name        Persistence-M| Bus-Id        Disp.A |
>>         Volatile
>>         Uncorr. ECC |
>>         | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage |
>>         GPU-Util
>>         Compute M. |
>>         |                               | |               MIG M. |
>>         |===============================+======================+======================|
>>         |   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off
>>         |                    0 |
>>         | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>         |   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off
>>         |                    0 |
>>         | N/A   28C    P0    51W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>         |   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off
>>         |                    0 |
>>         | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>         |   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off
>>         |                    0 |
>>         | N/A   29C    P0    54W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>         |   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off
>>         |                    0 |
>>         | N/A   34C    P0    57W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>         |   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off
>>         |                    0 |
>>         | N/A   31C    P0    55W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>         |   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off
>>         |                    0 |
>>         | N/A   31C    P0    51W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>         |   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off
>>         |                    0 |
>>         | N/A   32C    P0    52W / 400W |      0MiB / 40536MiB |
>>         0%      Default |
>>         |                               | | Disabled |
>>         +-------------------------------+----------------------+----------------------+
>>
>>         +-----------------------------------------------------------------------------+
>>         | Processes: |
>>         |  GPU   GI   CI        PID   Type   Process name GPU Memory |
>>         |        ID   ID Usage      |
>>         |=============================================================================|
>>         |  No running processes
>>         found |
>>         +-----------------------------------------------------------------------------+
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221027/3af0bb99/attachment.htm>


More information about the slurm-users mailing list