[slurm-users] GPU Allocation does not limit number of available GPUs in job
Dominik Baack
dominik.baack at cs.uni-dortmund.de
Thu Oct 27 17:46:21 UTC 2022
Thank you very much!
Those were the missing settings!
I am not sure how I overlooked it for nearly two days, but I am happy
that its working now.
Cheers
Dominik Baack
Am 27.10.2022 um 19:23 schrieb Sean Maxwell:
> It looks like you are missing some of the slurm.conf entries related
> to enforcing the cgroup restrictions. I would go through the list here
> and verify/adjust your configuration:
>
> https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf
>
> Best,
>
> -Sean
>
>
>
> On Thu, Oct 27, 2022 at 1:04 PM Dominik Baack
> <dominik.baack at cs.uni-dortmund.de> wrote:
>
> Hi,
>
> yes ContrainDevices is set:
>
> ###
> # Slurm cgroup support configuration file
> ###
> CgroupAutomount=yes
> #
> #CgroupMountpoint="/sys/fs/cgroup"
> ConstrainCores=yes
> ConstrainDevices=yes
> ConstrainRAMSpace=yes
> #
> #
>
> I attached the slurm configuration file as well
>
> Cheers
> Dominik
>
> Am 27.10.2022 um 17:57 schrieb Sean Maxwell:
>> Hi Dominik,
>>
>> Do you have ConstrainDevices=yes set in your cgroup.conf?
>>
>> Best,
>>
>> -Sean
>>
>> On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack
>> <dominik.baack at cs.uni-dortmund.de> wrote:
>>
>> Hi,
>>
>> We are in the process of setting up SLURM on some DGX A100
>> nodes . We
>> are experiencing the problem that all GPUs are available for
>> users, even
>> for jobs where only one should be assigned.
>>
>> It seems the requirement is forwarded correctly to the node,
>> at least
>> CUDA_VISIBLE_DEVICES is set to the correct id only discarded
>> by the rest
>> of the system.
>>
>> Cheers
>> Dominik Baack
>>
>> Example:
>>
>> baack at gwkilab:~$ srun --gpus=1 nvidia-smi
>> Thu Oct 27 17:39:04 2022
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA
>> Version:
>> 11.4 |
>> |-------------------------------+----------------------+----------------------+
>> | GPU Name Persistence-M| Bus-Id Disp.A |
>> Volatile
>> Uncorr. ECC |
>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage |
>> GPU-Util
>> Compute M. |
>> | | | MIG M. |
>> |===============================+======================+======================|
>> | 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off
>> | 0 |
>> | N/A 28C P0 52W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>> | 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off
>> | 0 |
>> | N/A 28C P0 51W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>> | 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off
>> | 0 |
>> | N/A 28C P0 52W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>> | 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off
>> | 0 |
>> | N/A 29C P0 54W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>> | 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off
>> | 0 |
>> | N/A 34C P0 57W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>> | 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off
>> | 0 |
>> | N/A 31C P0 55W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>> | 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off
>> | 0 |
>> | N/A 31C P0 51W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>> | 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off
>> | 0 |
>> | N/A 32C P0 52W / 400W | 0MiB / 40536MiB |
>> 0% Default |
>> | | | Disabled |
>> +-------------------------------+----------------------+----------------------+
>>
>> +-----------------------------------------------------------------------------+
>> | Processes: |
>> | GPU GI CI PID Type Process name GPU Memory |
>> | ID ID Usage |
>> |=============================================================================|
>> | No running processes
>> found |
>> +-----------------------------------------------------------------------------+
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221027/3af0bb99/attachment.htm>
More information about the slurm-users
mailing list