Dear list,
after an update form 22.05 to 23.11 a host where I'm using MIG started to discard the mig devices during the NVML setuop.
I can see those lines in the log:
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
where the MIG devices are discarded during the start of the slurmd deamon.
this is my gres.conf file
cat /etc/slurm/gres.conf
##################################################################
# Slurm's Generic Resource (GRES) configuration file
##################################################################
AutoDetect=nvml
and that still works for hosts on 22.05
Any idea of what is going on?
I tried to disable MIG on the upgraded server, and then all the A100 are recognized without issues.
Thanks
Cristiano
slurmd: debug: Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1
slurmd: debug: cgroup/v1: init: Cgroup v1 plugin loaded
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/lib/slurm/slurmd/hwloc_topo_whole.xml) found
slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1
slurmd: debug: gres/gpu: init: loaded
slurmd: debug: gpu/nvml: init: init: GPU NVML plugin loaded
slurmd: debug2: gpu/nvml: _nvml_init: Successfully initialized NVML
slurmd: debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 550.54.15
slurmd: debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 12.550.54.15
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVML API Version: 11
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 64
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 4
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0:
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-808805ee-2ae5-ee6d-14fd-6e63028d4a2a
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:1:0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:01:00.0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0,0,0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 24-31
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 24-31
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: enabled
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG count: 2
slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 0, MIG index 0:
slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb
slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-ab1b2cb1-d690-5564-885e-056f23c1c618
slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-ab1b2cb1-d690-5564-885e-056f23c1c618
slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 1
slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0
slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 12
slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 13
slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13
slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 0, MIG index 1:
slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb
slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5
slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5
slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 2
slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0
slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 21
slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 22
slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22
slurmd: debug2: Possible GPU Memory Frequencies (1):
slurmd: debug2: -------------------------------
slurmd: debug2: *1593 MHz [0]
slurmd: debug2: Possible GPU Graphics Frequencies (81):
slurmd: debug2: ---------------------------------
slurmd: debug2: *1410 MHz [0]
slurmd: debug2: *1395 MHz [1]
slurmd: debug2: ...
slurmd: debug2: *810 MHz [40]
slurmd: debug2: ...
slurmd: debug2: *225 MHz [79]
slurmd: debug2: *210 MHz [80]
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 1:
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-05911075-e3e1-973b-708a-33a77ddd381c
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:65:0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:41:00.0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,-1,0,0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia1
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 8-15
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 8-15
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: enabled
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG count: 2
slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 1, MIG index 0:
slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb
slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-36b02f58-efa6-5071-9956-510c8a8e705a
slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-36b02f58-efa6-5071-9956-510c8a8e705a
slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 1
slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0
slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 147
slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 148
slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148
slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 1, MIG index 1:
slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb
slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6
slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6
slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 2
slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0
slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 156
slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 157
slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157
slurmd: debug2: Possible GPU Memory Frequencies (1):
slurmd: debug2: -------------------------------
slurmd: debug2: *1593 MHz [0]
slurmd: debug2: Possible GPU Graphics Frequencies (81):
slurmd: debug2: ---------------------------------
slurmd: debug2: *1410 MHz [0]
slurmd: debug2: *1395 MHz [1]
slurmd: debug2: ...
slurmd: debug2: *810 MHz [40]
slurmd: debug2: ...
slurmd: debug2: *225 MHz [79]
slurmd: debug2: *210 MHz [80]
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 2:
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-7b67964f-9217-908f-042f-f37f06cbb1a2
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:129:0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:81:00.0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,-1,4
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia2
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 56-63
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 56-63
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled
slurmd: debug2: Possible GPU Memory Frequencies (1):
slurmd: debug2: -------------------------------
slurmd: debug2: *1593 MHz [0]
slurmd: debug2: Possible GPU Graphics Frequencies (81):
slurmd: debug2: ---------------------------------
slurmd: debug2: *1410 MHz [0]
slurmd: debug2: *1395 MHz [1]
slurmd: debug2: ...
slurmd: debug2: *810 MHz [40]
slurmd: debug2: ...
slurmd: debug2: *225 MHz [79]
slurmd: debug2: *210 MHz [80]
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 3:
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-87d7526f-6a29-214e-1c24-722a20211263
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:193:0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:C1:00.0
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,4,-1
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia3
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 40-47
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 40-47
slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled
slurmd: debug2: Possible GPU Memory Frequencies (1):
slurmd: debug2: -------------------------------
slurmd: debug2: *1593 MHz [0]
slurmd: debug2: Possible GPU Graphics Frequencies (81):
slurmd: debug2: ---------------------------------
slurmd: debug2: *1410 MHz [0]
slurmd: debug2: *1395 MHz [1]
slurmd: debug2: ...
slurmd: debug2: *810 MHz [40]
slurmd: debug2: ...
slurmd: debug2: *225 MHz [79]
slurmd: debug2: *210 MHz [80]
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: debug: Gres GPU plugin: Merging configured GRES with system GPUs
slurmd: debug2: gres/gpu: _merge_system_gres_conf: gres_list_conf:
slurmd: debug2: GRES[gpu] Type:a100_3g.39gb Count:4 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:2 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: debug: gres/gpu: _merge_system_gres_conf: Including the following GPU matched between system and configuration:
slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null)
slurmd: debug: gres/gpu: _merge_system_gres_conf: Including the following GPU matched between system and configuration:
slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: warning: The following autodetected GPUs are being ignored:
slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):24-31 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13 UniqueId:MIG-ab1b2cb1-d690-5564-885e-056f23c1c618
slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):24-31 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5
slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):8-15 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148 UniqueId:MIG-36b02f58-efa6-5071-9956-510c8a8e705a
slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):8-15 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6
slurmd: debug2: gres/gpu: _merge_system_gres_conf: gres_list_gpu
slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null)
slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null)
slurmd: debug: Gres GPU plugin: Final merged GRES list:
slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null)
slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null)
slurmd: Gres Name=gpu Type=a100-sxm4-80gb Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=a100-sxm4-80gb Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: topology/default: init: topology Default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug: Resource spec: No specialized cores configured by default on this node
slurmd: debug: Resource spec: Reserved system memory limit not configured for this node
slurmd: debug: task/cgroup: init: Tasks containment cgroup plugin loaded
slurmd: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
slurmd: debug: spank: opening plugin stack /var/lib/slurm/slurmd/conf-cache/plugstack.conf
slurmd: debug: /var/lib/slurm/slurmd/conf-cache/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
slurmd: cred/munge: init: Munge credential signature plugin loaded
slurmd: warning: Core limit is only 0 KB
slurmd: slurmd version 23.11.4 started
slurmd: debug2: No acct_gather.conf file (/var/lib/slurm/slurmd/conf-cache/acct_gather.conf)
slurmd: debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
slurmd: debug: MPI: Loading all types
slurmd: debug: mpi/pmix_v4: init: PMIx plugin loaded
slurmd: debug: mpi/pmix_v4: init: PMIx plugin loaded
slurmd: debug2: No mpi.conf file (/var/lib/slurm/slurmd/conf-cache/mpi.conf)
slurmd: slurmd started on Fri, 10 May 2024 14:30:30 +0000
slurmd: CPUs=64 Boards=1 Sockets=2 Cores=32 Threads=1 Memory=1031664 TmpDisk=0 Uptime=2512944 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
^Cslurmd: got shutdown request
slurmd: all threads complete
slurmd: debug: mpi/pmix_v4: fini: (null) [0]: mpi_pmix.c:203: (null): call fini()
slurmd: debug: mpi/pmix_v4: fini: (null) [0]: mpi_pmix.c:203: (null): call fini()
slurmd: debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded
slurmd: debug2: acct_gather_profile_startpoll: poll already ended!
slurmd: debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded
slurmd: debug: task/affinity: fini: task affinity plugin unloaded
slurmd: debug: hash/k12: fini: fini: unloading KangarooTwelve hash plugin
slurmd: debug: gres/gpu: fini: unloading
^Cslurmd: debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML
slurmd: debug: gpu/nvml: fini: fini: unloading GPU NVML plugin
slurmd: debug2: acct_gather_profile_startpoll: poll already ended!
slurmd: debug: cgroup/v1: fini: unloading Cgroup v1 plugin
slurmd: cred/munge: fini: Munge credential signature plugin unloaded
slurmd: Slurmd shutdown completing