[slurm-users] GPU Allocation does not limit number of available GPUs in job

Sean Maxwell stm at case.edu
Thu Oct 27 15:57:07 UTC 2022


Hi Dominik,

Do you have ConstrainDevices=yes set in your cgroup.conf?

Best,

-Sean

On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack <
dominik.baack at cs.uni-dortmund.de> wrote:

> Hi,
>
> We are in the process of setting up SLURM on some DGX A100 nodes . We
> are experiencing the problem that all GPUs are available for users, even
> for jobs where only one should be assigned.
>
> It seems the requirement is forwarded correctly to the node, at least
> CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest
> of the system.
>
> Cheers
> Dominik Baack
>
> Example:
>
> baack at gwkilab:~$ srun --gpus=1 nvidia-smi
> Thu Oct 27 17:39:04 2022
>
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version:
> 11.4     |
>
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
> Uncorr. ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
> Compute M. |
> |                               | |               MIG M. |
>
> |===============================+======================+======================|
> |   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off
> |                    0 |
> | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
> |   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off
> |                    0 |
> | N/A   28C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
> |   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off
> |                    0 |
> | N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
> |   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off
> |                    0 |
> | N/A   29C    P0    54W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
> |   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off
> |                    0 |
> | N/A   34C    P0    57W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
> |   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off
> |                    0 |
> | N/A   31C    P0    55W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
> |   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off
> |                    0 |
> | N/A   31C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
> |   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off
> |                    0 |
> | N/A   32C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
> |                               | |             Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
>
> +-----------------------------------------------------------------------------+
> | Processes: |
> |  GPU   GI   CI        PID   Type   Process name GPU Memory |
> |        ID   ID Usage      |
>
> |=============================================================================|
> |  No running processes
> found                                                 |
>
> +-----------------------------------------------------------------------------+
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221027/f578839b/attachment-0001.htm>


More information about the slurm-users mailing list