Hello,
*@Reed *- Great suggestion. I do see a variety of different "Board Part Number", but I don't see a correlation with the Board Part Number and whether a DGX works or not
*@Russel, @Michael* - The behavior still exists even when the reservation is removed. I added the reservation to prevent user production work from landing on the node and still be able to debug dgx09. For completion here is the reservation :
$ scontrol show reservation ReservationName=g09_test StartTime=2025-11-04T13:23:47 EndTime=2026-11-04T13:23:47 Duration=365-00:00:00 Nodes=dgx09 NodeCnt=1 CoreCnt=112 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=224 Users=user Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null)
*@Christopher* - I am running Slurm version 23.02.6. Regarding making sure that the GPU names are the same, I ran `slurmd -G` on dgx[03-09] and wrote the contents of each to a file. I then ran diff between output from each dgx[03-08] and compared it to dgx09. They are identical. Reposting the output from `slurmd -G` on dgx[03-09] :
$ slurmd -G slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487 File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487 File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487 File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487 File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487 File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0 Flags=HAS_FILE,HAS_TYPE,ENV_NVML slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487 File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1 Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*@Christopher* - I tried copying /cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml from a working node to dgx09. I restarted slurmd on dgx09. When I submitted a job, requesting a GPU, I got the same error : srun: error: Unable to create step for job 113424: Invalid generic resource (gres) specification
*@Michael *- Running srun with -vvv : $ srun -vvv --gres=gpu:1 --reservation=g09_test --pty bash srun: defined options srun: -------------------- -------------------- srun: gres : gres:gpu:1 srun: pty : srun: reservation : g09_test srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=18446744073709551615 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=2061374 srun: debug: propagating RLIMIT_NOFILE=131072 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 44393 srun: debug: Entering _msg_thr_internal srun: Waiting for resource configuration srun: Nodes dgx09 are ready for job srun: jobid 113415: nodes(1):`dgx09', cpu counts: 2(x1) srun: debug2: creating job with 1 tasks srun: debug2: cpu:2 is not a gres: srun: debug: requesting job 113415, user 99, nodes 1 including ((null)) srun: debug: cpus 2, tasks 1, name bash, relative 65534 srun: error: Unable to create step for job 113415: Invalid generic resource (gres) specification srun: debug2: eio_message_socket_accept: got message connection from 148.117.15.76:51912 6
Best, Lee
On Wed, Nov 26, 2025 at 7:33 PM Russell Jones via slurm-users < slurm-users@lists.schedmd.com> wrote:
Yes I agree about the reservation, that was the next thing I was about to focus on.....
Please do show your res config.
On Wed, Nov 26, 2025, 3:26 PM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 11/13/25 2:16 pm, Lee via slurm-users wrote:
- When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |
grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09 reports "Gres=gpu:*h100*:8(S:0-1)"
Two thoughts:
- Looking at the 24.11 code when it's using NVML to get the names
everything gets lowercased - so I wonder if these new ones are getting correctly discovered by NVML but the older ones are not and so using the uppercase values in your config?
gpu_common_underscorify_tolower(device_name);I would suggest making sure the GPU names are lower-cased everywhere for consistency.
- From memory (away from work at the moment) slurmd caches hwloc
library information in an XML file - you might want to go and find that on an older and newer node and compare those to see if you see the same difference there. It could be interesting to see if you stop slurmd on an older node, move that XML file out of the way start slurmd whether that changes how it reports the node.
Also I saw you posted "slurmd -G" on the new one, could you post that from an older one too please?
Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com