Hello,
@Reed - Great suggestion. I do see a variety of different "Board Part Number", but I don't see a correlation with the Board Part Number and whether a DGX works or not
@Russel, @Michael - The behavior still exists even when the reservation is removed. I added the reservation to prevent user production work from landing on the node and still be able to debug dgx09. For completion here is the reservation :
$ scontrol show reservation
ReservationName=g09_test StartTime=2025-11-04T13:23:47 EndTime=2026-11-04T13:23:47 Duration=365-00:00:00
Nodes=dgx09 NodeCnt=1 CoreCnt=112 Features=(null) PartitionName=(null) Flags=SPEC_NODES
TRES=cpu=224
Users=user Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
@Christopher - I am running Slurm version 23.02.6. Regarding making sure that the GPU names are the same, I ran `slurmd -G` on dgx[03-09] and wrote the contents of each to a file. I then ran diff between output from each dgx[03-08] and compared it to dgx09. They are identical. Reposting the output from `slurmd -G` on dgx[03-09] :
$ slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
@Christopher - I tried copying /cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml from a working node to dgx09. I restarted slurmd on dgx09. When I submitted a job, requesting a GPU, I got the same error :
srun: error: Unable to create step for job 113424: Invalid generic resource (gres) specification
@Michael - Running srun with -vvv :
$ srun -vvv --gres=gpu:1 --reservation=g09_test --pty bash
srun: defined options
srun: -------------------- --------------------
srun: gres : gres:gpu:1
srun: pty :
srun: reservation : g09_test
srun: verbose : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=18446744073709551615
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=2061374
srun: debug: propagating RLIMIT_NOFILE=131072
srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 44393
srun: debug: Entering _msg_thr_internal
srun: Waiting for resource configuration
srun: Nodes dgx09 are ready for job
srun: jobid 113415: nodes(1):`dgx09', cpu counts: 2(x1)
srun: debug2: creating job with 1 tasks
srun: debug2: cpu:2 is not a gres:
srun: debug: requesting job 113415, user 99, nodes 1 including ((null))
srun: debug: cpus 2, tasks 1, name bash, relative 65534
srun: error: Unable to create step for job 113415: Invalid generic resource (gres) specification
srun: debug2: eio_message_socket_accept: got message connection from
148.117.15.76:51912 6
Best,
Lee