[slurm-users] nvml autodetect is ignoring gpus

Benjamin Nacar benjamin_nacar at brown.edu
Wed Dec 1 13:42:06 UTC 2021


Confirmed that adding just the "Gres=" bit in slurm.conf works. That's what I get for reading the documentation too fast... thanks all!

~~ bnacar

On Wed, 1 Dec 2021 14:05:09 +0100
Quirin Lohr <quirin.lohr at in.tum.de> wrote:

> Hi,
> 
> you still need to specify the gpus in the node definition in slurm.conf. 
> At least the number, perhaps even the type reported by nvml must match 
> the node definition. (Gres=gpu:geforce_gtx_1080:4)
> 
> I think the error message can be ignored, the 1080 just does not support 
> this feature.
> 
> 
> Am 30.11.2021 um 16:12 schrieb Benjamin Nacar:
> > Hi,
> > 
> > We're trying to use Slurm's built-in Nvidia GPU detection mechanism to avoid having to specify GPUs explicitly in slurm.conf and gres.conf. We're running Debian 11, and the version of Slurm available for Debian 11 is 20.11. However, the version of Slurm in the standard debian repositories was apparently not compiled on a system with the necessary Nvidia library installed, so we recompiled Slurm 20.11 from the Debian source package with no modifications.
> > 
> > With AutoDetect=nvml in gres.conf and GresTypes=gpu in slurm.conf, this is what we see on a 4-GPU host after restarting slurmd:
> > 
> > [2021-11-29T15:49:58.226] Node reconfigured socket/core boundaries SocketsPerBoard=12:2(hw) CoresPerSocket=1:6(hw)
> > [2021-11-29T15:50:02.397] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.398] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.399] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
> > [2021-11-29T15:50:02.551] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
> > [2021-11-29T15:50:02.551] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
> > [2021-11-29T15:50:02.551]     GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(12):0-11  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
> > [2021-11-29T15:50:02.614] slurmd version 20.11.4 started
> > [2021-11-29T15:50:02.630] slurmd started on Mon, 29 Nov 2021 15:50:02 -0500
> > [2021-11-29T15:50:02.630] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=257840 TmpDisk=3951 Uptime=975072 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
> > 
> > Doing an "scontrol show node" for this host displays "Gres=(null)", and any attempts to submit a job with --gpus=1 results in "srun: error: Unable to allocate resources: Requested node configuration is not available".
> > 
> > Any idea what might be wrong?
> > 
> > Thanks,
> > ~~ bnacar
> > 
> 
> -- 
> Quirin Lohr
> Systemadministration
> Technische Universität München
> Fakultät für Informatik
> Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz
> 
> Boltzmannstrasse 3
> 85748 Garching
> 
> Tel. +49 89 289 17769
> Fax +49 89 289 17757
> 
> quirin.lohr at in.tum.de
> www.vision.in.tum.de

-- 
Benjamin Nacar
Systems Programmer
Computer Science Department
Brown University
401.863.7621



More information about the slurm-users mailing list