[slurm-users] GPU Allocation does not limit number of available GPUs in job

Sean Maxwell stm at case.edu
Thu Oct 27 17:56:40 UTC 2022


No problem! Glad it is working for you now.

Best,

-Sean

On Thu, Oct 27, 2022 at 1:46 PM Dominik Baack <
dominik.baack at cs.uni-dortmund.de> wrote:

> Thank you very much!
>
> Those were the missing settings!
>
> I am not sure how I overlooked it for nearly two days, but I am happy that
> its working now.
>
> Cheers
> Dominik Baack
>
>
> Am 27.10.2022 um 19:23 schrieb Sean Maxwell:
>
> It looks like you are missing some of the slurm.conf entries related to
> enforcing the cgroup restrictions. I would go through the list here and
> verify/adjust your configuration:
>
> https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf
>
> Best,
>
> -Sean
>
>
>
> On Thu, Oct 27, 2022 at 1:04 PM Dominik Baack <
> dominik.baack at cs.uni-dortmund.de> wrote:
>
>> Hi,
>>
>> yes ContrainDevices is set:
>>
>> ###
>> # Slurm cgroup support configuration file
>> ###
>> CgroupAutomount=yes
>> #
>> #CgroupMountpoint="/sys/fs/cgroup"
>> ConstrainCores=yes
>> ConstrainDevices=yes
>> ConstrainRAMSpace=yes
>> #
>> #
>>
>> I attached the slurm configuration file as well
>>
>> Cheers
>> Dominik
>> Am 27.10.2022 um 17:57 schrieb Sean Maxwell:
>>
>> Hi Dominik,
>>
>> Do you have ConstrainDevices=yes set in your cgroup.conf?
>>
>> Best,
>>
>> -Sean
>>
>> On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack <
>> dominik.baack at cs.uni-dortmund.de> wrote:
>>
>>> Hi,
>>>
>>> We are in the process of setting up SLURM on some DGX A100 nodes . We
>>> are experiencing the problem that all GPUs are available for users, even
>>> for jobs where only one should be assigned.
>>>
>>> It seems the requirement is forwarded correctly to the node, at least
>>> CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest
>>> of the system.
>>>
>>> Cheers
>>> Dominik Baack
>>>
>>> Example:
>>>
>>> baack at gwkilab:~$ srun --gpus=1 nvidia-smi
>>> Thu Oct 27 17:39:04 2022
>>>
>>> +-----------------------------------------------------------------------------+
>>> | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version:
>>> 11.4     |
>>>
>>> |-------------------------------+----------------------+----------------------+
>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>>> Uncorr. ECC |
>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>> Compute M. |
>>> |                               | |               MIG M. |
>>>
>>> |===============================+======================+======================|
>>> |   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off
>>> |                    0 |
>>> | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off
>>> |                    0 |
>>> | N/A   28C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off
>>> |                    0 |
>>> | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off
>>> |                    0 |
>>> | N/A   29C    P0    54W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off
>>> |                    0 |
>>> | N/A   34C    P0    57W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off
>>> |                    0 |
>>> | N/A   31C    P0    55W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off
>>> |                    0 |
>>> | N/A   31C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>> |   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off
>>> |                    0 |
>>> | N/A   32C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default
>>> |
>>> |                               | |             Disabled |
>>>
>>> +-------------------------------+----------------------+----------------------+
>>>
>>>
>>> +-----------------------------------------------------------------------------+
>>> | Processes: |
>>> |  GPU   GI   CI        PID   Type   Process name GPU Memory |
>>> |        ID   ID Usage      |
>>>
>>> |=============================================================================|
>>> |  No running processes
>>> found                                                 |
>>>
>>> +-----------------------------------------------------------------------------+
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221027/39c6ef68/attachment-0001.htm>


More information about the slurm-users mailing list