[slurm-users] GPU Allocation does not limit number of available GPUs in job

Dominik Baack dominik.baack at cs.uni-dortmund.de
Thu Oct 27 15:47:25 UTC 2022


Hi,

We are in the process of setting up SLURM on some DGX A100 nodes . We 
are experiencing the problem that all GPUs are available for users, even 
for jobs where only one should be assigned.

It seems the requirement is forwarded correctly to the node, at least 
CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest 
of the system.

Cheers
Dominik Baack

Example:

baack at gwkilab:~$ srun --gpus=1 nvidia-smi
Thu Oct 27 17:39:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 
11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile 
Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
Compute M. |
|                               | |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off 
|                    0 |
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off 
|                    0 |
| N/A   28C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off 
|                    0 |
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off 
|                    0 |
| N/A   29C    P0    54W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off 
|                    0 |
| N/A   34C    P0    57W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off 
|                    0 |
| N/A   31C    P0    55W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off 
|                    0 |
| N/A   31C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off 
|                    0 |
| N/A   32C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
|  GPU   GI   CI        PID   Type   Process name GPU Memory |
|        ID   ID Usage      |
|=============================================================================|
|  No running processes 
found                                                 |
+-----------------------------------------------------------------------------+




More information about the slurm-users mailing list