[slurm-users] NVML autodetect "Failed to get supported memory frequencies" error
Joshua Baker-LePain
jlb at salilab.org
Fri Mar 5 04:36:15 UTC 2021
We're in the midst of transitioning our SGE cluster to slurm 20.02.6, running
on up-to-date CentOS-7. We built RPMs from the standard tarball against CUDA
10.1. These RPMs worked just fine on our first GPU test node (with Tesla K80s)
using "AutoDetect=nvml" in /etc/gres.conf. However, we just tried to add a
second host with GTX 1080s in it. Running "slurmd -G" results in the following
output:
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory frequencies
slurmd: error: for the GPU : Not Supported
slurmd: 4 GPU system device(s) detected
slurmd: WARNING: The following autodetected GPUs are being ignored:
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55
slurmd: Links:0,0,0,-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia3
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):28-55
slurmd: Links:0,0,-1,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia2
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27
slurmd: Links:0,-1,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia1
slurmd: GRES[gpu] Type:geforce_gtx_1080 Count:1 Cores(1024):0-27
slurmd: Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
My googling has utterly failed me on this. Any help? Thanks!
--
Joshua Baker-LePain
Wynton Cluster Sysadmin
UCSF
More information about the slurm-users
mailing list