[slurm-users] nvml autodetect is ignoring gpus
Quirin Lohr
quirin.lohr at in.tum.de
Wed Dec 1 13:05:09 UTC 2021
Hi,
you still need to specify the gpus in the node definition in slurm.conf.
At least the number, perhaps even the type reported by nvml must match
the node definition. (Gres=gpu:geforce_gtx_1080:4)
I think the error message can be ignored, the 1080 just does not support
this feature.
Am 30.11.2021 um 16:12 schrieb Benjamin Nacar:
> Hi,
>
> We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're running Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed, so we recompiled Slurm 20.11 from the Debian source package with no modifications.
>
> With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what we see on a 4-GPU host after restarting slurmd:
>
> [2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
> [2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> [2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
> [2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
> [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
> [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
> [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
> [2021-11-29T15:50:02.551] GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
> [2021-11-29T15:50:02.614] slurmd version 20.11.4 started
> [2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
> [2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
>
> Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node configuration is not available".
>
> Any idea what might be wrong?
>
> Thanks,
> ~~ bnacar
>
--
Quirin Lohr
Systemadministration
Technische Universität München
Fakultät für Informatik
Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz
Boltzmannstrasse 3
85748 Garching
Tel. +49 89 289 17769
Fax +49 89 289 17757
quirin.lohr at in.tum.de
www.vision.in.tum.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5563 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211201/f3dcaff4/attachment.bin>
More information about the slurm-users
mailing list