Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : * 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports " Gres=gpu:*h100*:8(S:0-1)"
2. When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : * 1. gres.conf : # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
3. What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : * 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
2. Why is dgx09 not accepting GPU jobs and afterwards it artificially thinks that there are GPUs allocated even though no jobs are on the node?
3. Are there additional tests / configurations that I can do to probe the differences between dgx09 and all my other nodes?
Best regards, Lee
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep
Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports " Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in
terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe the
differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes
in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe the
differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image and libraries in place? Could the NVidia NVML library, used to auto-detect the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX node, do they look the same? Does scontrol show config look different on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted slurmd on the compute nodes?
Kind regards
-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School -- ________________________________ From: Lee via slurm-users slurm-users@lists.schedmd.com Sent: Friday, November 14, 2025 7:17 AM To: John Hearns hearnsj@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Invalid generic resource (gres) specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns <hearnsj@gmail.commailto:hearnsj@gmail.com> wrote: I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> wrote: Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
Symptoms : 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep Gres`, 7/8 DGXs report "Gres=gpu:H100:8(S:0-1)" while dgx09 reports "Gres=gpu:h100:8(S:0-1)"
2. When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES AllocTRES=gres/gpu=2
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
Configuration : 1. gres.conf : # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
3. What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
Questions : 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
2. Why is dgx09 not accepting GPU jobs and afterwards it artificially thinks that there are GPUs allocated even though no jobs are on the node?
3. Are there additional tests / configurations that I can do to probe the differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
Hello,
Sorry for the delayed response, SC25 interfered with my schedule.
*Answers* : 1. Yes, dgx09 and all the others boot the same software images.
2. dgx09 and the other nodes mount a shared file system where Slurm is installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is the same for every node. I assume the library that is used for autodetection lives there. I also found a shared library /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 (within the software image). I checked the md5sum and it is the same on both dgx09 and a non-broken node.
3. `scontrol show config` is the same on dgx09 and a non-broken DGX.
4. The only meaningful difference between `scontrol show node` for dgx09 and dgx08 (a working node) is :
< Gres=gpu:*h100*:8(S:0-1) ---
Gres=gpu:*H100*:8(S:0-1)
5. Yes, we've restarted slurmd and slurmctld several times, the behavior persists. Of note, when I run `scontrol reconfigure`, the phantom allocated GPUs (see AllocTRES in original post) are cleared.
*Important Update :* 1. We recently had another GPU tray replaced and now that DGX is experiencing the same behavior as dgx09. I am more convinced that there is something subtle with how the hardware is being detected by Slurm.
Best regards, Lee
On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick michael_timony@hms.harvard.edu wrote:
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image and libraries in place? Could the NVidia NVML library, used to auto-detect the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX node, do they look the same? Does scontrol show config look different on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted slurmd on the compute nodes?
Kind regards
-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School
--
*From:* Lee via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, November 14, 2025 7:17 AM *To:* John Hearns hearnsj@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: Invalid generic resource (gres) specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. |
|=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage |
|=======================================================================================| | No running processes found |
+---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep
Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports " Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in
terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe the
differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
Just in case, that line shows you are missing a bracket in the node name. Are you *actually* missing the bracket?
On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello,
Sorry for the delayed response, SC25 interfered with my schedule.
*Answers* :
Yes, dgx09 and all the others boot the same software images.
dgx09 and the other nodes mount a shared file system where Slurm is
installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is the same for every node. I assume the library that is used for autodetection lives there. I also found a shared library /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 (within the software image). I checked the md5sum and it is the same on both dgx09 and a non-broken node.
`scontrol show config` is the same on dgx09 and a non-broken DGX.
The only meaningful difference between `scontrol show node` for dgx09
and dgx08 (a working node) is :
< Gres=gpu:*h100*:8(S:0-1)
Gres=gpu:*H100*:8(S:0-1)
- Yes, we've restarted slurmd and slurmctld several times, the behavior
persists. Of note, when I run `scontrol reconfigure`, the phantom allocated GPUs (see AllocTRES in original post) are cleared.
*Important Update :*
- We recently had another GPU tray replaced and now that DGX is
experiencing the same behavior as dgx09. I am more convinced that there is something subtle with how the hardware is being detected by Slurm.
Best regards, Lee
On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick < michael_timony@hms.harvard.edu> wrote:
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image and libraries in place? Could the NVidia NVML library, used to auto-detect the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX node, do they look the same? Does scontrol show config look different on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted slurmd on the compute nodes?
Kind regards
-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School
--
*From:* Lee via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, November 14, 2025 7:17 AM *To:* John Hearns hearnsj@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: Invalid generic resource (gres) specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. |
|=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage |
|=======================================================================================| | No running processes found |
+---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes
in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe the
differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hello,
@Russel - good catch. No, I'm not actually missing the square bracket. It got lost during the copy/paste. I'll restate it below for clarity : 2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
@Keshav : It still doesn't work user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:h100:2 --pty bash srun: error: Unable to create step for job 107044: Invalid generic resource (gres) specification
Best, Lee
On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:
NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
Just in case, that line shows you are missing a bracket in the node name. Are you *actually* missing the bracket?
On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello,
Sorry for the delayed response, SC25 interfered with my schedule.
*Answers* :
Yes, dgx09 and all the others boot the same software images.
dgx09 and the other nodes mount a shared file system where Slurm is
installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is the same for every node. I assume the library that is used for autodetection lives there. I also found a shared library /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 (within the software image). I checked the md5sum and it is the same on both dgx09 and a non-broken node.
`scontrol show config` is the same on dgx09 and a non-broken DGX.
The only meaningful difference between `scontrol show node` for dgx09
and dgx08 (a working node) is :
< Gres=gpu:*h100*:8(S:0-1)
Gres=gpu:*H100*:8(S:0-1)
- Yes, we've restarted slurmd and slurmctld several times, the behavior
persists. Of note, when I run `scontrol reconfigure`, the phantom allocated GPUs (see AllocTRES in original post) are cleared.
*Important Update :*
- We recently had another GPU tray replaced and now that DGX is
experiencing the same behavior as dgx09. I am more convinced that there is something subtle with how the hardware is being detected by Slurm.
Best regards, Lee
On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick < michael_timony@hms.harvard.edu> wrote:
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image and libraries in place? Could the NVidia NVML library, used to auto-detect the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX node, do they look the same? Does scontrol show config look different on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted slurmd on the compute nodes?
Kind regards
-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School
--
*From:* Lee via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, November 14, 2025 7:17 AM *To:* John Hearns hearnsj@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: Invalid generic resource (gres) specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. |
|=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage |
|=======================================================================================| | No running processes found |
+---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes
in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe
the differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Can you give the output of "scontrol show node dgx09" ?
Are there any errors in your slurmctld.log?
Are there any errors in slurmd.log on dgx09 node?
On Tue, Nov 25, 2025 at 12:13 PM Lee leewithemily@gmail.com wrote:
Hello,
@Russel - good catch. No, I'm not actually missing the square bracket. It got lost during the copy/paste. I'll restate it below for clarity : 2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
@Keshav : It still doesn't work user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:h100:2 --pty bash srun: error: Unable to create step for job 107044: Invalid generic resource (gres) specification
Best, Lee
On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:
NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
Just in case, that line shows you are missing a bracket in the node name. Are you *actually* missing the bracket?
On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello,
Sorry for the delayed response, SC25 interfered with my schedule.
*Answers* :
Yes, dgx09 and all the others boot the same software images.
dgx09 and the other nodes mount a shared file system where Slurm is
installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is the same for every node. I assume the library that is used for autodetection lives there. I also found a shared library /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 (within the software image). I checked the md5sum and it is the same on both dgx09 and a non-broken node.
`scontrol show config` is the same on dgx09 and a non-broken DGX.
The only meaningful difference between `scontrol show node` for dgx09
and dgx08 (a working node) is :
< Gres=gpu:*h100*:8(S:0-1)
Gres=gpu:*H100*:8(S:0-1)
- Yes, we've restarted slurmd and slurmctld several times, the behavior
persists. Of note, when I run `scontrol reconfigure`, the phantom allocated GPUs (see AllocTRES in original post) are cleared.
*Important Update :*
- We recently had another GPU tray replaced and now that DGX is
experiencing the same behavior as dgx09. I am more convinced that there is something subtle with how the hardware is being detected by Slurm.
Best regards, Lee
On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick < michael_timony@hms.harvard.edu> wrote:
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image and libraries in place? Could the NVidia NVML library, used to auto-detect the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX node, do they look the same? Does scontrol show config look different on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted slurmd on the compute nodes?
Kind regards
-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School
--
*From:* Lee via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, November 14, 2025 7:17 AM *To:* John Hearns hearnsj@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: Invalid generic resource (gres) specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. |
|=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage |
|=======================================================================================| | No running processes found |
+---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes
in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe
the differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hello,
1. Output from `scontrol show node=dgx09` user@l01:~$ scontrol show node=dgx09 NodeName=dgx09 Arch=x86_64 CoresPerSocket=56 CPUAlloc=0 CPUEfctv=224 CPUTot=224 CPULoad=0.98 AvailableFeatures=location=local ActiveFeatures=location=local Gres=gpu:h100:8(S:0-1) NodeAddr=dgx09 NodeHostName=dgx09 Version=23.02.6 OS=Linux 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 RealMemory=2063937 AllocMem=0 FreeMem=2033902 Sockets=2 Boards=1 MemSpecLimit=30017 State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=defq BootTime=2025-11-04T13:57:26 SlurmdStartTime=2025-11-05T15:40:46 LastBusyTime=2025-11-25T13:07:36 ResumeAfterTime=None CfgTRES=cpu=224,mem=2063937M,billing=448,gres/gpu=8 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s ReservationName=g09_test
2. I don't see any errors in slurmctld related to dgx09, when I submit a job :
user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:2 --pty bash srun: error: Unable to create step for job 108596: Invalid generic resource (gres) specification
slurmctld shows : [2025-11-26T10:57:42.592] sched: _slurm_rpc_allocate_resources JobId=108596 NodeList=dgx09 usec=1495 [2025-11-26T10:57:42.695] _job_complete: JobId=108596 WTERMSIG 1 [2025-11-26T10:57:42.695] _job_complete: JobId=108596 done
3. Grep'ing for jobid and for errors on dgx09:/var/log/slurmd returns nothing, i.e. root@dgx09:~# grep -i error /var/log/slurmd. # no output root@dgx09:~# grep -i 108596 /var/log/slurmd # no output
Looking at journalctl : root@dgx09:~# journalctl -fu slurmd.service Nov 26 10:57:33 dgx09 slurmd[1751949]: slurmd: Resource spec: system cgroup memory limit set to 30017 MBNov 26 10:57:34 dgx09 slurmd[1751949]: slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
Best, Lee
On Tue, Nov 25, 2025 at 1:24 PM Russell Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:
Can you give the output of "scontrol show node dgx09" ?
Are there any errors in your slurmctld.log?
Are there any errors in slurmd.log on dgx09 node?
On Tue, Nov 25, 2025 at 12:13 PM Lee leewithemily@gmail.com wrote:
Hello,
@Russel - good catch. No, I'm not actually missing the square bracket. It got lost during the copy/paste. I'll restate it below for clarity : 2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
@Keshav : It still doesn't work user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:h100:2 --pty bash srun: error: Unable to create step for job 107044: Invalid generic resource (gres) specification
Best, Lee
On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:
NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
Just in case, that line shows you are missing a bracket in the node name. Are you *actually* missing the bracket?
On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello,
Sorry for the delayed response, SC25 interfered with my schedule.
*Answers* :
Yes, dgx09 and all the others boot the same software images.
dgx09 and the other nodes mount a shared file system where Slurm is
installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is the same for every node. I assume the library that is used for autodetection lives there. I also found a shared library /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 (within the software image). I checked the md5sum and it is the same on both dgx09 and a non-broken node.
`scontrol show config` is the same on dgx09 and a non-broken DGX.
The only meaningful difference between `scontrol show node` for
dgx09 and dgx08 (a working node) is :
< Gres=gpu:*h100*:8(S:0-1)
Gres=gpu:*H100*:8(S:0-1)
- Yes, we've restarted slurmd and slurmctld several times, the
behavior persists. Of note, when I run `scontrol reconfigure`, the phantom allocated GPUs (see AllocTRES in original post) are cleared.
*Important Update :*
- We recently had another GPU tray replaced and now that DGX is
experiencing the same behavior as dgx09. I am more convinced that there is something subtle with how the hardware is being detected by Slurm.
Best regards, Lee
On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick < michael_timony@hms.harvard.edu> wrote:
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image and libraries in place? Could the NVidia NVML library, used to auto-detect the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX node, do they look the same? Does scontrol show config look different on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted slurmd on the compute nodes?
Kind regards
-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School
--
*From:* Lee via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, November 14, 2025 7:17 AM *To:* John Hearns hearnsj@gmail.com *Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: Invalid generic resource (gres) specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. |
|=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage |
|=======================================================================================| | No running processes found |
+---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX
nodes in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe
the differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
unfortunately i don't know what your issue is, but i inclined to think it might be something odd with your reservation. adding a scontrol show reservation g09_test might be helpful to others
also if you haven't already, you might want to try increasing the debug logs to the max, something might be getting lost in the logs and add -vvv's to the srun
On Wed, Nov 26, 2025 at 4:26 PM Lee via slurm-users slurm-users@lists.schedmd.com wrote:
Hello,
- Output from `scontrol show node=dgx09`
user@l01:~$ scontrol show node=dgx09 NodeName=dgx09 Arch=x86_64 CoresPerSocket=56 CPUAlloc=0 CPUEfctv=224 CPUTot=224 CPULoad=0.98 AvailableFeatures=location=local ActiveFeatures=location=local Gres=gpu:h100:8(S:0-1) NodeAddr=dgx09 NodeHostName=dgx09 Version=23.02.6 OS=Linux 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 RealMemory=2063937 AllocMem=0 FreeMem=2033902 Sockets=2 Boards=1 MemSpecLimit=30017 State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=defq BootTime=2025-11-04T13:57:26 SlurmdStartTime=2025-11-05T15:40:46 LastBusyTime=2025-11-25T13:07:36 ResumeAfterTime=None CfgTRES=cpu=224,mem=2063937M,billing=448,gres/gpu=8 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s ReservationName=g09_test
- I don't see any errors in slurmctld related to dgx09, when I submit a job :
user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:2 --pty bash srun: error: Unable to create step for job 108596: Invalid generic resource (gres) specification
slurmctld shows : [2025-11-26T10:57:42.592] sched: _slurm_rpc_allocate_resources JobId=108596 NodeList=dgx09 usec=1495 [2025-11-26T10:57:42.695] _job_complete: JobId=108596 WTERMSIG 1 [2025-11-26T10:57:42.695] _job_complete: JobId=108596 done
- Grep'ing for jobid and for errors on dgx09:/var/log/slurmd returns nothing, i.e.
root@dgx09:~# grep -i error /var/log/slurmd. # no output root@dgx09:~# grep -i 108596 /var/log/slurmd # no output
Looking at journalctl : root@dgx09:~# journalctl -fu slurmd.service Nov 26 10:57:33 dgx09 slurmd[1751949]: slurmd: Resource spec: system cgroup memory limit set to 30017 MBNov 26 10:57:34 dgx09 slurmd[1751949]: slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
Best, Lee
On Tue, Nov 25, 2025 at 1:24 PM Russell Jones via slurm-users slurm-users@lists.schedmd.com wrote:
Can you give the output of "scontrol show node dgx09" ?
Are there any errors in your slurmctld.log?
Are there any errors in slurmd.log on dgx09 node?
On Tue, Nov 25, 2025 at 12:13 PM Lee leewithemily@gmail.com wrote:
Hello,
@Russel - good catch. No, I'm not actually missing the square bracket. It got lost during the copy/paste. I'll restate it below for clarity : 2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx[03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
@Keshav : It still doesn't work user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:h100:2 --pty bash srun: error: Unable to create step for job 107044: Invalid generic resource (gres) specification
Best, Lee
On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users slurm-users@lists.schedmd.com wrote:
NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
Just in case, that line shows you are missing a bracket in the node name. Are you *actually* missing the bracket?
On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users slurm-users@lists.schedmd.com wrote:
Hello,
Sorry for the delayed response, SC25 interfered with my schedule.
Answers :
Yes, dgx09 and all the others boot the same software images.
dgx09 and the other nodes mount a shared file system where Slurm is installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is the same for every node. I assume the library that is used for autodetection lives there. I also found a shared library /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0 (within the software image). I checked the md5sum and it is the same on both dgx09 and a non-broken node.
`scontrol show config` is the same on dgx09 and a non-broken DGX.
The only meaningful difference between `scontrol show node` for dgx09 and dgx08 (a working node) is :
< Gres=gpu:h100:8(S:0-1)
Gres=gpu:H100:8(S:0-1)
- Yes, we've restarted slurmd and slurmctld several times, the behavior persists. Of note, when I run `scontrol reconfigure`, the phantom allocated GPUs (see AllocTRES in original post) are cleared.
Important Update :
- We recently had another GPU tray replaced and now that DGX is experiencing the same behavior as dgx09. I am more convinced that there is something subtle with how the hardware is being detected by Slurm.
Best regards, Lee
On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick michael_timony@hms.harvard.edu wrote:
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image and libraries in place? Could the NVidia NVML library, used to auto-detect the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX node, do they look the same? Does scontrol show config look different on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted slurmd on the compute nodes?
Kind regards
-- Mick Timony Senior DevOps Engineer LASER, Longwood, & O2 Cluster Admin Harvard Medical School -- ________________________________ From: Lee via slurm-users slurm-users@lists.schedmd.com Sent: Friday, November 14, 2025 7:17 AM To: John Hearns hearnsj@gmail.com Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Invalid generic resource (gres) specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users slurm-users@lists.schedmd.com wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
Symptoms :
When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep Gres`, 7/8 DGXs report "Gres=gpu:H100:8(S:0-1)" while dgx09 reports "Gres=gpu:h100:8(S:0-1)"
When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES AllocTRES=gres/gpu=2
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
Configuration :
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
Questions :
As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
Why is dgx09 not accepting GPU jobs and afterwards it artificially thinks that there are GPUs allocated even though no jobs are on the node?
Are there additional tests / configurations that I can do to probe the differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 11/13/25 2:16 pm, Lee via slurm-users wrote:
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
Two thoughts:
1) Looking at the 24.11 code when it's using NVML to get the names everything gets lowercased - so I wonder if these new ones are getting correctly discovered by NVML but the older ones are not and so using the uppercase values in your config?
gpu_common_underscorify_tolower(device_name);
I would suggest making sure the GPU names are lower-cased everywhere for consistency.
2) From memory (away from work at the moment) slurmd caches hwloc library information in an XML file - you might want to go and find that on an older and newer node and compare those to see if you see the same difference there. It could be interesting to see if you stop slurmd on an older node, move that XML file out of the way start slurmd whether that changes how it reports the node.
Also I saw you posted "slurmd -G" on the new one, could you post that from an older one too please?
Best of luck, Chris
Yes I agree about the reservation, that was the next thing I was about to focus on.....
Please do show your res config.
On Wed, Nov 26, 2025, 3:26 PM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 11/13/25 2:16 pm, Lee via slurm-users wrote:
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
Two thoughts:
- Looking at the 24.11 code when it's using NVML to get the names
everything gets lowercased - so I wonder if these new ones are getting correctly discovered by NVML but the older ones are not and so using the uppercase values in your config?
gpu_common_underscorify_tolower(device_name);I would suggest making sure the GPU names are lower-cased everywhere for consistency.
- From memory (away from work at the moment) slurmd caches hwloc
library information in an XML file - you might want to go and find that on an older and newer node and compare those to see if you see the same difference there. It could be interesting to see if you stop slurmd on an older node, move that XML file out of the way start slurmd whether that changes how it reports the node.
Also I saw you posted "slurmd -G" on the new one, could you post that from an older one too please?
Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hello,
*@Reed *- Great suggestion. I do see a variety of different "Board Part Number", but I don't see a correlation with the Board Part Number and whether a DGX works or not
*@Russel, @Michael* - The behavior still exists even when the reservation is removed. I added the reservation to prevent user production work from landing on the node and still be able to debug dgx09. For completion here is the reservation :
$ scontrol show reservation ReservationName=g09_test StartTime=2025-11-04T13:23:47 EndTime=2026-11-04T13:23:47 Duration=365-00:00:00 Nodes=dgx09 NodeCnt=1 CoreCnt=112 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=224 Users=user Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null)
*@Christopher* - I am running Slurm version 23.02.6. Regarding making sure that the GPU names are the same, I ran `slurmd -G` on dgx[03-09] and wrote the contents of each to a file. I then ran diff between output from each dgx[03-08] and compared it to dgx09. They are identical. Reposting the output from `slurmd -G` on dgx[03-09] :
$ slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*@Christopher* - I tried copying /cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml from a working node to dgx09. I restarted slurmd on dgx09. When I submitted a job, requesting a GPU, I got the same error : srun: error: Unable to create step for job 113424: Invalid generic resource (gres) specification
*@Michael *- Running srun with -vvv : $ srun -vvv --gres=gpu:1 --reservation=g09_test --pty bash srun: defined options srun: -------------------- -------------------- srun: gres : gres:gpu:1 srun: pty : srun: reservation : g09_test srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=18446744073709551615 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=2061374 srun: debug: propagating RLIMIT_NOFILE=131072 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 44393 srun: debug: Entering _msg_thr_internal srun: Waiting for resource configuration srun: Nodes dgx09 are ready for job srun: jobid 113415: nodes(1):`dgx09', cpu counts: 2(x1) srun: debug2: creating job with 1 tasks srun: debug2: cpu:2 is not a gres: srun: debug: requesting job 113415, user 99, nodes 1 including ((null)) srun: debug: cpus 2, tasks 1, name bash, relative 65534 srun: error: Unable to create step for job 113415: Invalid generic resource (gres) specification srun: debug2: eio_message_socket_accept: got message connection from 148.117.15.76:51912 6
Best, Lee
On Wed, Nov 26, 2025 at 7:33 PM Russell Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:
Yes I agree about the reservation, that was the next thing I was about to focus on.....
Please do show your res config.
On Wed, Nov 26, 2025, 3:26 PM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 11/13/25 2:16 pm, Lee via slurm-users wrote:
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
Two thoughts:
- Looking at the 24.11 code when it's using NVML to get the names
everything gets lowercased - so I wonder if these new ones are getting correctly discovered by NVML but the older ones are not and so using the uppercase values in your config?
gpu_common_underscorify_tolower(device_name);I would suggest making sure the GPU names are lower-cased everywhere for consistency.
- From memory (away from work at the moment) slurmd caches hwloc
library information in an XML file - you might want to go and find that on an older and newer node and compare those to see if you see the same difference there. It could be interesting to see if you stop slurmd on an older node, move that XML file out of the way start slurmd whether that changes how it reports the node.
Also I saw you posted "slurmd -G" on the new one, could you post that from an older one too please?
Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com