[slurm-users] Re: Invalid generic resource (gres) specification after RMA

4 Dec 2025

      Hello,
*@Reed *- Great suggestion.  I do see a variety of different "Board Part
Number", but I don't see a correlation with the Board Part Number and
whether a DGX works or not
*@Russel, @Michael* - The behavior still exists even when the reservation
is removed.  I added the reservation to prevent user production work from
landing on the node and still be able to debug dgx09.  For completion here
is the reservation :
$ scontrol show reservation
    ReservationName=g09_test StartTime=2025-11-04T13:23:47
EndTime=2026-11-04T13:23:47 Duration=365-00:00:00
       Nodes=dgx09 NodeCnt=1 CoreCnt=112 Features=(null)
PartitionName=(null) Flags=SPEC_NODES
       TRES=cpu=224
       Users=user Groups=(null) Accounts=(null) Licenses=(null)
State=ACTIVE BurstBuffer=(null) Watts=n/a
       MaxStartDelay=(null)
*@Christopher* - I am running Slurm version 23.02.6.  Regarding making sure
that the GPU names are the same, I ran `slurmd -G` on dgx[03-09] and wrote
the contents of each to a file.  I then ran diff between output from each
dgx[03-08] and compared it to dgx09.  They are identical.  Reposting the
output from `slurmd -G` on dgx[03-09] :
$ slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487
File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487
File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487
File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487
File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487
File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487
File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487
File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487
File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*@Christopher* - I tried copying
/cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml
from a working node to dgx09.  I restarted slurmd on dgx09.  When I
submitted a job, requesting a GPU, I got the same error :
srun: error: Unable to create step for job 113424: Invalid generic resource
(gres) specification
*@Michael *- Running srun with -vvv :
$ srun -vvv --gres=gpu:1 --reservation=g09_test --pty bash
srun: defined options
srun: -------------------- --------------------
srun: gres                : gres:gpu:1
srun: pty                 :
srun: reservation         : g09_test
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=2061374
srun: debug:  propagating RLIMIT_NOFILE=131072
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 44393
srun: debug:  Entering _msg_thr_internal
srun: Waiting for resource configuration
srun: Nodes dgx09 are ready for job
srun: jobid 113415: nodes(1):`dgx09', cpu counts: 2(x1)
srun: debug2: creating job with 1 tasks
srun: debug2: cpu:2 is not a gres:
srun: debug:  requesting job 113415, user 99, nodes 1 including ((null))
srun: debug:  cpus 2, tasks 1, name bash, relative 65534
srun: error: Unable to create step for job 113415: Invalid generic resource
(gres) specification
srun: debug2: eio_message_socket_accept: got message connection from
148.117.15.76:51912 6
Best,
Lee
On Wed, Nov 26, 2025 at 7:33 PM Russell Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
Yes I agree about the reservation, that was the next thing I was about to
focus on.....
Please do show your res config.
On Wed, Nov 26, 2025, 3:26 PM Christopher Samuel via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
On 11/13/25 2:16 pm, Lee via slurm-users wrote:
...

When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |

grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09
reports "Gres=gpu:*h100*:8(S:0-1)"
Two thoughts:

Looking at the 24.11 code when it's using NVML to get the names

everything gets lowercased - so I wonder if these new ones are getting
correctly discovered by NVML but the older ones are not and so using the
uppercase values in your config?
    gpu_common_underscorify_tolower(device_name);

I would suggest making sure the GPU names are lower-cased everywhere for
consistency.

From memory (away from work at the moment) slurmd caches hwloc

library information in an XML file - you might want to go and find that
on an older and newer node and compare those to see if you see the same
difference there.  It could be interesting to see if you stop slurmd on
an older node, move that XML file out of the way start slurmd whether
that changes how it reports the node.
Also I saw you posted "slurmd -G" on the new one, could you post that
from an older one too please?
Best of luck,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Philadelphia, PA, USA
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2026

2025

2024

[slurm-users] Re: Invalid generic resource (gres) specification after RMA