[slurm-users] nvml autodetect is ignoring gpus

Benjamin Nacar benjamin_nacar at brown.edu
Tue Nov 30 15:12:42 UTC 2021


Hi,

We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're running Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed, so we recompiled Slurm 20.11 from the Debian source package with no modifications.

With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what we see on a 4-GPU host after restarting slurmd:

[2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
[2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
[2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2021-11-29T15:50:02.614] slurmd version 20.11.4 started
[2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
[2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node configuration is not available".

Any idea what might be wrong?

Thanks,
~~ bnacar

-- 
Benjamin Nacar
Systems Programmer
Computer Science Department
Brown University
401.863.7621



More information about the slurm-users mailing list