[slurm-users] NVML autodetect "Failed to get supported memory frequencies" error

Joshua Baker-LePain jlb at salilab.org
Fri Mar 5 04:36:15 UTC 2021


We're in the midst of transitioning our SGE cluster to slurm 20.02.6, running 
on up-to-date CentOS-7.  We built RPMs from the standard tarball against CUDA 
10.1.  These RPMs worked just fine on our first GPU test node (with Tesla K80s) 
using "AutoDetect=nvml" in /etc/gres.conf.  However, we just tried to add a 
second host with GTX 1080s in it.  Running "slurmd -G" results in the following 
output:

slurmd: error:  _nvml_get_mem_freqs: Failed to get supported memory frequencies 
slurmd: error:  for the GPU : Not Supported
slurmd: error:  _nvml_get_mem_freqs: Failed to get supported memory frequencies 
slurmd: error:  for the GPU : Not Supported
slurmd: error:  _nvml_get_mem_freqs: Failed to get supported memory frequencies 
slurmd: error:  for the GPU : Not Supported
slurmd: error:  _nvml_get_mem_freqs: Failed to get supported memory frequencies 
slurmd: error:  for the GPU : Not Supported
slurmd:  4 GPU system device(s) detected
slurmd:  WARNING: The following autodetected GPUs are being ignored:
slurmd:      GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55 
slurmd:      Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
slurmd:      GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55 
slurmd:      Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
slurmd:      GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27 
slurmd:      Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
slurmd:      GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27 
slurmd:      Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0

My googling has utterly failed me on this.  Any help?  Thanks!

-- 
Joshua Baker-LePain
Wynton Cluster Sysadmin
UCSF




More information about the slurm-users mailing list