Hello,

@Reed - Great suggestion. I do see a variety of different "Board Part Number", but I don't see a correlation with the Board Part Number and whether a DGX works or not

@Russel, @Michael - The behavior still exists even when the reservation is removed. I added the reservation to prevent user production work from landing on the node and still be able to debug dgx09. For completion here is the reservation :

$ scontrol show reservation
ReservationName=g09_test StartTime=2025-11-04T13:23:47 EndTime=2026-11-04T13:23:47 Duration=365-00:00:00
Nodes=dgx09 NodeCnt=1 CoreCnt=112 Features=(null) PartitionName=(null) Flags=SPEC_NODES
TRES=cpu=224
Users=user Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)

@Christopher - I am running Slurm version 23.02.6. Regarding making sure that the GPU names are the same, I ran `slurmd -G` on dgx[03-09] and wrote the contents of each to a file. I then ran diff between output from each dgx[03-08] and compared it to dgx09. They are identical. Reposting the output from `slurmd -G` on dgx[03-09] :

$ slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML

@Christopher - I tried copying /cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml from a working node to dgx09. I restarted slurmd on dgx09. When I submitted a job, requesting a GPU, I got the same error :

srun: error: Unable to create step for job 113424: Invalid generic resource (gres) specification

@Michael - Running srun with -vvv :

$ srun -vvv --gres=gpu:1 --reservation=g09_test --pty bash
srun: defined options
srun: -------------------- --------------------
srun: gres : gres:gpu:1
srun: pty :
srun: reservation : g09_test
srun: verbose : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=18446744073709551615
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=2061374
srun: debug: propagating RLIMIT_NOFILE=131072
srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 44393
srun: debug: Entering _msg_thr_internal
srun: Waiting for resource configuration
srun: Nodes dgx09 are ready for job
srun: jobid 113415: nodes(1):`dgx09', cpu counts: 2(x1)
srun: debug2: creating job with 1 tasks
srun: debug2: cpu:2 is not a gres:
srun: debug: requesting job 113415, user 99, nodes 1 including ((null))
srun: debug: cpus 2, tasks 1, name bash, relative 65534
srun: error: Unable to create step for job 113415: Invalid generic resource (gres) specification
srun: debug2: eio_message_socket_accept: got message connection from 148.117.15.76:51912 6

Best,

Lee

On Wed, Nov 26, 2025 at 7:33 PM Russell Jones via slurm-users <slurm-users@lists.schedmd.com> wrote:

Yes I agree about the reservation, that was the next thing I was about to focus on.....

Please do show your res config.

On Wed, Nov 26, 2025, 3:26 PM Christopher Samuel via slurm-users <slurm-users@lists.schedmd.com> wrote:
On 11/13/25 2:16 pm, Lee via slurm-users wrote:

> 1. When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
> grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09
> reports "Gres=gpu:*h100*:8(S:0-1)"

Two thoughts:

1) Looking at the 24.11 code when it's using NVML to get the names
everything gets lowercased - so I wonder if these new ones are getting
correctly discovered by NVML but the older ones are not and so using the
uppercase values in your config?

gpu_common_underscorify_tolower(device_name);

I would suggest making sure the GPU names are lower-cased everywhere for
consistency.

2) From memory (away from work at the moment) slurmd caches hwloc
library information in an XML file - you might want to go and find that
on an older and newer node and compare those to see if you see the same
difference there. It could be interesting to see if you stop slurmd on
an older node, move that XML file out of the way start slurmd whether
that changes how it reports the node.

Also I saw you posted "slurmd -G" on the new one, could you post that
from an older one too please?

Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com