Dear list,
after an update form 22.05 to 23.11 a host where I'm using MIG started to discard the mig devices during the NVML setuop.
I can see those lines in the log:
slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
where the MIG devices are discarded during the start of the slurmd deamon.
this is my gres.conf file
cat /etc/slurm/gres.conf ################################################################## # Slurm's Generic Resource (GRES) configuration file ################################################################## AutoDetect=nvml
and that still works for hosts on 22.05
Any idea of what is going on?
I tried to disable MIG on the upgraded server, and then all the A100 are recognized without issues.
Thanks
Cristiano
slurmd: debug: Log file re-opened slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug2: hwloc_topology_export_xml slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1 slurmd: debug: cgroup/v1: init: Cgroup v1 plugin loaded slurmd: debug2: hwloc_topology_init slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/lib/slurm/slurmd/hwloc_topo_whole.xml) found slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:32 ThreadsPerCore:1 slurmd: debug: gres/gpu: init: loaded slurmd: debug: gpu/nvml: init: init: GPU NVML plugin loaded slurmd: debug2: gpu/nvml: _nvml_init: Successfully initialized NVML slurmd: debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 550.54.15 slurmd: debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 12.550.54.15 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVML API Version: 11 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 64 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 4 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-808805ee-2ae5-ee6d-14fd-6e63028d4a2a slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:1:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:01:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0,0,0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 24-31 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 24-31 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: enabled slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG count: 2 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 0, MIG index 0: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-ab1b2cb1-d690-5564-885e-056f23c1c618 slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-ab1b2cb1-d690-5564-885e-056f23c1c618 slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 1 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 12 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 13 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 0, MIG index 1: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5 slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5 slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 2 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 21 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 22 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 1: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-05911075-e3e1-973b-708a-33a77ddd381c slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:65:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:41:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,-1,0,0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia1 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 8-15 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 8-15 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: enabled slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG count: 2 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 1, MIG index 0: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-36b02f58-efa6-5071-9956-510c8a8e705a slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-36b02f58-efa6-5071-9956-510c8a8e705a slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 1 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 147 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 148 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148 slurmd: debug2: gpu/nvml: _handle_mig: GPU minor 1, MIG index 1: slurmd: debug2: gpu/nvml: _handle_mig: MIG Profile: nvidia_a100_3g.40gb slurmd: debug2: gpu/nvml: _handle_mig: MIG UUID: MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6 slurmd: debug2: gpu/nvml: _handle_mig: UniqueID: MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6 slurmd: debug2: gpu/nvml: _handle_mig: GPU Instance (GI) ID: 2 slurmd: debug2: gpu/nvml: _handle_mig: Compute Instance (CI) ID: 0 slurmd: debug2: gpu/nvml: _handle_mig: GI Minor Number: 156 slurmd: debug2: gpu/nvml: _handle_mig: CI Minor Number: 157 slurmd: debug2: gpu/nvml: _handle_mig: Device Files: /dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 2: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-7b67964f-9217-908f-042f-f37f06cbb1a2 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:129:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:81:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,-1,4 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia2 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 56-63 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 56-63 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 3: slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a100-sxm4-80gb slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-87d7526f-6a29-214e-1c24-722a20211263 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:193:0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:C1:00.0 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,4,-1 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia3 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 40-47 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 40-47 slurmd: debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled slurmd: debug2: Possible GPU Memory Frequencies (1): slurmd: debug2: ------------------------------- slurmd: debug2: *1593 MHz [0] slurmd: debug2: Possible GPU Graphics Frequencies (81): slurmd: debug2: --------------------------------- slurmd: debug2: *1410 MHz [0] slurmd: debug2: *1395 MHz [1] slurmd: debug2: ... slurmd: debug2: *810 MHz [40] slurmd: debug2: ... slurmd: debug2: *225 MHz [79] slurmd: debug2: *210 MHz [80] slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected slurmd: debug: Gres GPU plugin: Merging configured GRES with system GPUs slurmd: debug2: gres/gpu: _merge_system_gres_conf: gres_list_conf: slurmd: debug2: GRES[gpu] Type:a100_3g.39gb Count:4 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:2 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: debug: gres/gpu: _merge_system_gres_conf: Including the following GPU matched between system and configuration: slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null) slurmd: debug: gres/gpu: _merge_system_gres_conf: Including the following GPU matched between system and configuration: slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: error: Discarding the following config-only GPU due to lack of File specification: slurmd: error: GRES[gpu] Type:a100_3g.39gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null) slurmd: warning: The following autodetected GPUs are being ignored: slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):24-31 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap12,/dev/nvidia-caps/nvidia-cap13 UniqueId:MIG-ab1b2cb1-d690-5564-885e-056f23c1c618 slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):24-31 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-c43ecbcb-9f3f-5972-8eb9-6a2a09848bd5 slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):8-15 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap147,/dev/nvidia-caps/nvidia-cap148 UniqueId:MIG-36b02f58-efa6-5071-9956-510c8a8e705a slurmd: GRES[gpu] Type:nvidia_a100_3g.40gb Count:1 Cores(64):8-15 Links:(null) Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-c5ca2779-f34d-544b-af52-55ef14aa0af6 slurmd: debug2: gres/gpu: _merge_system_gres_conf: gres_list_gpu slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null) slurmd: debug2: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null) slurmd: debug: Gres GPU plugin: Final merged GRES list: slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):56-63 Links:0,0,-1,4 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2 UniqueId:(null) slurmd: debug: GRES[gpu] Type:a100-sxm4-80gb Count:1 Cores(64):40-47 Links:0,0,4,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3 UniqueId:(null) slurmd: Gres Name=gpu Type=a100-sxm4-80gb Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=a100-sxm4-80gb Count=1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: topology/default: init: topology Default plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Resource spec: Reserved system memory limit not configured for this node slurmd: debug: task/cgroup: init: Tasks containment cgroup plugin loaded slurmd: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff slurmd: debug: spank: opening plugin stack /var/lib/slurm/slurmd/conf-cache/plugstack.conf slurmd: debug: /var/lib/slurm/slurmd/conf-cache/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf" slurmd: cred/munge: init: Munge credential signature plugin loaded slurmd: warning: Core limit is only 0 KB slurmd: slurmd version 23.11.4 started slurmd: debug2: No acct_gather.conf file (/var/lib/slurm/slurmd/conf-cache/acct_gather.conf) slurmd: debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded slurmd: debug: MPI: Loading all types slurmd: debug: mpi/pmix_v4: init: PMIx plugin loaded slurmd: debug: mpi/pmix_v4: init: PMIx plugin loaded slurmd: debug2: No mpi.conf file (/var/lib/slurm/slurmd/conf-cache/mpi.conf) slurmd: slurmd started on Fri, 10 May 2024 14:30:30 +0000 slurmd: CPUs=64 Boards=1 Sockets=2 Cores=32 Threads=1 Memory=1031664 TmpDisk=0 Uptime=2512944 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) ^Cslurmd: got shutdown request slurmd: all threads complete slurmd: debug: mpi/pmix_v4: fini: (null) [0]: mpi_pmix.c:203: (null): call fini() slurmd: debug: mpi/pmix_v4: fini: (null) [0]: mpi_pmix.c:203: (null): call fini() slurmd: debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded slurmd: debug2: acct_gather_profile_startpoll: poll already ended! slurmd: debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded slurmd: debug: task/affinity: fini: task affinity plugin unloaded slurmd: debug: hash/k12: fini: fini: unloading KangarooTwelve hash plugin slurmd: debug: gres/gpu: fini: unloading ^Cslurmd: debug2: gpu/nvml: _nvml_shutdown: Successfully shut down NVML slurmd: debug: gpu/nvml: fini: fini: unloading GPU NVML plugin slurmd: debug2: acct_gather_profile_startpoll: poll already ended! slurmd: debug: cgroup/v1: fini: unloading Cgroup v1 plugin slurmd: cred/munge: fini: Munge credential signature plugin unloaded slurmd: Slurmd shutdown completing