Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it has several Nvidia DGXs. dgx09 is a problem child. The entire node was RMA'd and now it no longer behaves the same as my other DGXs. I think the below symptoms are caused by a single underlying issue.
*Symptoms : * 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY | grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports " Gres=gpu:*h100*:8(S:0-1)"
2. When I submit a job to this node, I get :
$ srun --reservation=g09_test --gres=gpu:2 --pty bash srun: error: Unable to create step for job 105035: Invalid generic resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed resources... $ scontrol show node=dgx09 | grep -i AllocTRES *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information ### slurmctld shows : root@h01:# grep 105035 /var/log/slurmctld [2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources JobId=105035 NodeList=dgx09 usec=3420 [2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1 [2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : * 1. gres.conf : # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE AutoDetect=NVML NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML # END AUTOGENERATED SECTION -- DO NOT REMOVE
2. grep NodeName slurm.conf root@h01:# grep NodeName slurm.conf NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32 Feature=location=local NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8 Feature=location=local
3. What slurmd detects on dgx09
root@dgx09:~# slurmd -C NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56 ThreadsPerCore=2 RealMemory=2063937 UpTime=8-00:39:10
root@dgx09:~# slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : * 1. As far as I can tell, dgx09 is identical to all my non-MIG DGX nodes in terms of configuration and hardware. Why does scontrol report it having 'h100' with a lower case 'h' unlike the other dgxs which report with an upper case 'H'?
2. Why is dgx09 not accepting GPU jobs and afterwards it artificially thinks that there are GPUs allocated even though no jobs are on the node?
3. Are there additional tests / configurations that I can do to probe the differences between dgx09 and all my other nodes?
Best regards, Lee