Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a working DGX :
root@dgx09:~# nvidia-smi Fri Nov 14 07:11:05 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 | | N/A 29C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 | | N/A 30C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 | | N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 | | N/A 31C P0 73W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 | | N/A 29C P0 68W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 | | N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 | | N/A 30C P0 70W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 | | N/A 32C P0 69W / 700W | 4MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
Best regards, Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD... diagnostics I woud run are lspci nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users < slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : *
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
- When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *
- gres.conf :
# This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
- grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
- What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *
- As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes
in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
- Why is dgx09 not accepting GPU jobs and afterwards it artificially
thinks that there are GPUs allocated even though no jobs are on the node?
- Are there additional tests / configurations that I can do to probe the
differences between dgx09 and all my other nodes?
Best regards, Lee
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com