[slurm-users] GPU Allocation does not limit number of available GPUs in job
Dominik Baack
dominik.baack at cs.uni-dortmund.de
Thu Oct 27 15:47:25 UTC 2022
Hi,
We are in the process of setting up SLURM on some DGX A100 nodes . We
are experiencing the problem that all GPUs are available for users, even
for jobs where only one should be assigned.
It seems the requirement is forwarded correctly to the node, at least
CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest
of the system.
Cheers
Dominik Baack
Example:
baack at gwkilab:~$ srun --gpus=1 nvidia-smi
Thu Oct 27 17:39:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version:
11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off
| 0 |
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off
| 0 |
| N/A 28C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off
| 0 |
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off
| 0 |
| N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off
| 0 |
| N/A 34C P0 57W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off
| 0 |
| N/A 31C P0 55W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off
| 0 |
| N/A 31C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off
| 0 |
| N/A 32C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes
found |
+-----------------------------------------------------------------------------+
More information about the slurm-users
mailing list