[slurm-users] nvml autodetect is ignoring gpus

Wed Dec 1 08:34:02 UTC 2021

I also compiled Slurm 20.11.8 to have GPU support in AlmaLinux 8.4 but 
don't have any problem with  NVML detecting our A100s.

¿Maybe the NVML library version used for Slurm compilation has to match 
the library version of the compute node where the GPU is?

Also, I see that you're using Geforce_GTX. ¿Could it be that NVML only 
supports Tesla GPUs?

This is my relevant Slurm configuration:

slurm.conf:

GresTypes=gpu,mps

NodeName=hpc-gpu[3-4].... Gres=gpu:A100:1

gres.conf:

NodeName=hpc-gpu[1-4] AutoDetect=nvml

and the NVIDIA part:

NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5

and this is what I see in the log:

[2021-12-01T09:29:45.675] debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1

[2021-12-01T09:29:45.675] debug:  gres/gpu: init: loaded

[2021-12-01T09:29:45.675] debug:  gres/mps: init: loaded

[2021-12-01T09:29:45.676] debug:  gpu/nvml: init: init: GPU NVML plugin loaded

[2021-12-01T09:29:46.298] debug2: gpu/nvml: _nvml_init: Successfully initialized NVML

[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 495.29.05

[2021-12-01T09:29:46.298] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.495.29.05

[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 64

[2021-12-01T09:29:46.298] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 1

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Name: nvidia_a100-pcie-40gb

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     UUID: GPU-4cbb41e9-296b-ba72-d345-aa41fd7a8842

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Domain/Bus/Device: 0:33:0

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     PCI Bus ID: 00000000:21:00.0

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     NVLinks: -1

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Device File (minor number): /dev/nvidia0

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     CPU Affinity Range - Machine: 16-23

[2021-12-01T09:29:46.365] debug2: gpu/nvml: _get_system_gpu_list_nvml:     Core Affinity Range - Abstract: 16-23

[2021-12-01T09:29:46.365] debug2: Possible GPU Memory Frequencies (1):

[2021-12-01T09:29:46.365] debug2: -------------------------------

[2021-12-01T09:29:46.365] debug2:     *1215 MHz [0]

[2021-12-01T09:29:46.365] debug2:         Possible GPU Graphics Frequencies (81):

[2021-12-01T09:29:46.365] debug2:         ---------------------------------

[2021-12-01T09:29:46.365] debug2:           *1410 MHz [0]

[2021-12-01T09:29:46.365] debug2:           *1395 MHz [1]

[2021-12-01T09:29:46.365] debug2:           ...

[2021-12-01T09:29:46.365] debug2:           *810 MHz [40]

[2021-12-01T09:29:46.365] debug2:           ...

[2021-12-01T09:29:46.365] debug2:           *225 MHz [79]

[2021-12-01T09:29:46.365] debug2:           *210 MHz [80]

[2021-12-01T09:29:46.555] debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML

[2021-12-01T09:29:46.555] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected

[2021-12-01T09:29:46.555] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs

[2021-12-01T09:29:46.555] debug2: gres/gpu: _normalize_gres_conf: gres_list_conf:

[2021-12-01T09:29:46.555] debug2:     GRES[gpu] Type:A100 Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE File:(null)

[2021-12-01T09:29:46.556] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug2: gres/gpu: _normalize_gres_conf: gres_list_gpu

[2021-12-01T09:29:46.556] debug2:     GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug:  Gres GPU plugin: Final normalized gres.conf list:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Initalized gres.conf list:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] debug:  Gres MPS plugin: Final gres.conf list:

[2021-12-01T09:29:46.556] debug:      GRES[gpu] Type:A100 Count:1 Cores(64):16-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

[2021-12-01T09:29:46.556] Gres Name=gpu Type=A100 Count=1

Hope it helps.

El 30/11/21 a las 16:12, Benjamin Nacar escribió:
> Hi,
>
> We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're running Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed, so we recompiled Slurm 20.11 from the Debian source package with no modifications.
>
> With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what we see on a 4-GPU host after restarting slurmd:
>
> [2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
> [2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
> [2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
> [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
> [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
> [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
> [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
> [2021-11-29T15:50:02.614] slurmd version 20.11.4 started
> [2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
> [2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
>
> Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node configuration is not available".
>
> Any idea what might be wrong?
>
> Thanks,
> ~~ bnacar
>
-- 
CiTIUS <http://citius.usc.es/> 	Fernando Guillén Camba 
<http://citius.usc.es/v/fernando.guillen>
Unidade de Xestión de Infraestruturas TIC
E-mail:fernando.guillen at usc.es <mailto:fernando.guillen at usc.es> · 
Phone:+34 881816409
Website: citius.usc.es <http://citius.usc.es> · Twitter: citiususc 
<http://twitter.com/citiususc>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/52180a7a/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: okgfnmicjbigndgl.png
Type: image/png
Size: 848 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/52180a7a/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: npgoecadhnjjennf.png
Type: image/png
Size: 212 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/52180a7a/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: algokamgibcfjeag.png
Type: image/png
Size: 199 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/52180a7a/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ghdebgmhedglakfk.png
Type: image/png
Size: 238 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/52180a7a/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: namlapalffecdnnf.png
Type: image/png
Size: 205 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/52180a7a/attachment-0009.png>