[slurm-users] Re: Invalid generic resource (gres) specification after RMA

26 Nov 2025

      Hello,
1. Output from `scontrol show node=dgx09`
user@l01:~$ scontrol show node=dgx09
NodeName=dgx09 Arch=x86_64 CoresPerSocket=56
   CPUAlloc=0 CPUEfctv=224 CPUTot=224 CPULoad=0.98
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:h100:8(S:0-1)
   NodeAddr=dgx09 NodeHostName=dgx09 Version=23.02.6
   OS=Linux 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023
   RealMemory=2063937 AllocMem=0 FreeMem=2033902 Sockets=2 Boards=1
   MemSpecLimit=30017
   State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=defq
   BootTime=2025-11-04T13:57:26 SlurmdStartTime=2025-11-05T15:40:46
   LastBusyTime=2025-11-25T13:07:36 ResumeAfterTime=None
   CfgTRES=cpu=224,mem=2063937M,billing=448,gres/gpu=8
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   ReservationName=g09_test
2. I don't see any errors in slurmctld related to dgx09, when I submit a
job :
user@l01:~$ srun --reservation=g09_test --nodelist=dgx09 --gres=gpu:2 --pty
bash
srun: error: Unable to create step for job 108596: Invalid generic resource
(gres) specification
slurmctld shows :
[2025-11-26T10:57:42.592] sched: _slurm_rpc_allocate_resources JobId=108596
NodeList=dgx09 usec=1495
[2025-11-26T10:57:42.695] _job_complete: JobId=108596 WTERMSIG 1
[2025-11-26T10:57:42.695] _job_complete: JobId=108596 done
3. Grep'ing for jobid and for errors on dgx09:/var/log/slurmd returns
nothing, i.e.
root@dgx09:~# grep -i error /var/log/slurmd.     # no output
root@dgx09:~# grep -i 108596 /var/log/slurmd  # no output
Looking at journalctl :
root@dgx09:~# journalctl -fu slurmd.service
Nov 26 10:57:33 dgx09 slurmd[1751949]: slurmd: Resource spec: system cgroup
memory limit set to 30017 MBNov 26 10:57:34 dgx09 slurmd[1751949]: slurmd:
gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s) detected
Best,
Lee
On Tue, Nov 25, 2025 at 1:24 PM Russell Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
Can you give the output of "scontrol show node dgx09" ?
Are there any errors in your slurmctld.log?
Are there any errors in slurmd.log on dgx09 node?
On Tue, Nov 25, 2025 at 12:13 PM Lee leewithemily@gmail.com wrote:
...
Hello,
@Russel - good catch.  No, I'm not actually missing the square bracket.
It got lost during the copy/paste.  I'll restate it below for clarity :
2. grep NodeName slurm.conf
root@h01:# grep NodeName slurm.conf
NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
Feature=location=local
NodeName=dgx*[*03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
Feature=location=local
@Keshav : It still doesn't work
user@l01:~$ srun --reservation=g09_test --nodelist=dgx09
--gres=gpu:h100:2 --pty bash
srun: error: Unable to create step for job 107044: Invalid generic
resource (gres) specification
Best,
Lee
On Tue, Nov 25, 2025 at 12:49 PM Russell Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
...
NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
Feature=location=local
Just in case, that line shows you are missing a bracket in the node
name. Are you *actually* missing the bracket?
On Tue, Nov 25, 2025 at 9:11 AM Lee via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
Hello,
Sorry for the delayed response, SC25 interfered with my schedule.
*Answers* :

Yes, dgx09 and all the others boot the same software images.

dgx09 and the other nodes mount a shared file system where Slurm is

installed, so /cm/shared/apps/slurm/23.02.6/lib64/slurm/gpu_nvml.so is
the same for every node.  I assume the library that is used for
autodetection lives there.  I also found a shared library /usr/lib/x86_64-linux-gnu/libnvml_injection.so.1.0
(within the software image).  I checked the md5sum and it is the same on
both dgx09 and a non-broken node.

`scontrol show config` is the same on dgx09 and a non-broken DGX.

The only meaningful difference between `scontrol show node` for

dgx09 and dgx08 (a working node) is :
<    Gres=gpu:*h100*:8(S:0-1)
...
Gres=gpu:*H100*:8(S:0-1)

Yes, we've restarted slurmd and slurmctld several times, the

behavior persists.  Of note, when I run `scontrol reconfigure`, the phantom
allocated GPUs (see AllocTRES in original post) are cleared.
*Important Update :*

We recently had another GPU tray replaced and now that DGX is

experiencing the same behavior as dgx09.  I am more convinced that there is
something subtle with how the hardware is being detected by Slurm.
Best regards,
Lee
On Mon, Nov 17, 2025 at 4:06 PM Timony, Mick <
michael_timony@hms.harvard.edu> wrote:
...
Hi Lee,
I manage a BCM cluster as well. Does DGX09 have the same disk image
and libraries in place? Could the NVidia NVML library, used to auto-detect
the GPU's, be a diff version and causing the case differences?
If you compare the output of scontrol show node dgx09 and another DGX
node, do they look the same? Does scontrol show config look different
on DGX09 and other nodes?
Have you restarted the Slurm controllers (slurmctld) and restarted
slurmd on the compute nodes?
Kind regards
--
Mick Timony
Senior DevOps Engineer
LASER, Longwood, & O2 Cluster Admin
Harvard Medical School
--
*From:* Lee via slurm-users slurm-users@lists.schedmd.com
*Sent:* Friday, November 14, 2025 7:17 AM
*To:* John Hearns hearnsj@gmail.com
*Cc:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: Invalid generic resource (gres)
specification after RMA
Hello,
Thank you for the suggestion.
I ran lspci on dgx09 and a working DGX and the output was identical.
nvidia-smi shows all 8 GPUs and looks the same as the output from a
working DGX :
root@dgx09:~# nvidia-smi
Fri Nov 14 07:11:05 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA
Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A |
Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage |
GPU-Util  Compute M. |
|                                         |                      |
          MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |
               0 |
| N/A   29C    P0              69W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |
               0 |
| N/A   30C    P0              71W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |
               0 |
| N/A   33C    P0              71W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |
               0 |
| N/A   31C    P0              73W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |
               0 |
| N/A   29C    P0              68W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |
               0 |
| N/A   28C    P0              69W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |
               0 |
| N/A   30C    P0              70W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |
               0 |
| N/A   32C    P0              69W / 700W |      4MiB / 81559MiB |
 0%      Default |
|                                         |                      |
        Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:
                 |
|  GPU   GI   CI        PID   Type   Process name
       GPU Memory |
|        ID   ID
      Usage      |
|=======================================================================================|
|  No running processes found
                  |
+---------------------------------------------------------------------------------------+
Best regards,
Lee
On Fri, Nov 14, 2025 at 3:53 AM John Hearns hearnsj@gmail.com wrote:
I work for AMD...
diagnostics I woud run are    lspci     nvidia-smi
On Thu, 13 Nov 2025 at 19:18, Lee via slurm-users <
slurm-users@lists.schedmd.com> wrote:
Good afternoon,
I have a cluster that is managed by Base Command Manager (v10) and it
has several Nvidia DGXs.  dgx09 is a problem child.  The entire node was
RMA'd and now it no longer behaves the same as my other DGXs.  I think the
below symptoms are caused by a single underlying issue.
*Symptoms : *

When I look at our 8 non-MIG DGXs, via `scontrol show node=dgxXY |

grep Gres`, 7/8 DGXs report "Gres=gpu:*H100*:8(S:0-1)" while dgx09
reports "Gres=gpu:*h100*:8(S:0-1)"

When I submit a job to this node, I get :

$ srun --reservation=g09_test --gres=gpu:2 --pty bash
srun: error: Unable to create step for job 105035: Invalid generic
resource (gres) specification
### No job is running on the node, yet AllocTRES shows consumed
resources...
$ scontrol show node=dgx09 | grep -i AllocTRES
   *AllocTRES=gres/gpu=2*
### dgx09 : /var/log/slurmd contains no information
### slurmctld shows :
root@h01:# grep 105035 /var/log/slurmctld
[2025-11-13T07:44:56.380] sched: _slurm_rpc_allocate_resources
JobId=105035 NodeList=dgx09 usec=3420
[2025-11-13T07:44:56.482] _job_complete: JobId=105035 WTERMSIG 1
[2025-11-13T07:44:56.483] _job_complete: JobId=105035 done
*Configuration : *

gres.conf :

# This section of this file was automatically generated by cmd. Do not
edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
AutoDetect=NVML
NodeName=dgx[01,02] Name=gpu Type=1g.20gb Count=32 AutoDetect=NVML
NodeName=dgx[03-10] Name=gpu Type=h100 Count=8 AutoDetect=NVML
# END AUTOGENERATED SECTION   -- DO NOT REMOVE

grep NodeName slurm.conf

root@h01:# grep NodeName slurm.conf
NodeName=dgx[01,02] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:1g.20gb:32
Feature=location=local
NodeName=dgx03-10] RealMemory=2063937 Boards=1 SocketsPerBoard=2
CoresPerSocket=56 ThreadsPerCore=2 MemSpecLimit=30017 Gres=gpu:H100:8
Feature=location=local

What slurmd detects on dgx09

root@dgx09:~# slurmd -C
NodeName=dgx09 CPUs=224 Boards=1 SocketsPerBoard=2 CoresPerSocket=56
ThreadsPerCore=2 RealMemory=2063937
UpTime=8-00:39:10
root@dgx09:~# slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 8 GPU system device(s)
detected
slurmd: Gres Name=gpu Type=h100 Count=1 Index=0 ID=7696487
File=/dev/nvidia0 Cores=0-55 CoreCnt=224 Links=-1,0,0,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=1 ID=7696487
File=/dev/nvidia1 Cores=0-55 CoreCnt=224 Links=0,-1,0,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=2 ID=7696487
File=/dev/nvidia2 Cores=0-55 CoreCnt=224 Links=0,0,-1,0,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=3 ID=7696487
File=/dev/nvidia3 Cores=0-55 CoreCnt=224 Links=0,0,0,-1,0,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=4 ID=7696487
File=/dev/nvidia4 Cores=56-111 CoreCnt=224 Links=0,0,0,0,-1,0,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=5 ID=7696487
File=/dev/nvidia5 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,-1,0,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=6 ID=7696487
File=/dev/nvidia6 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,-1,0
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
slurmd: Gres Name=gpu Type=h100 Count=1 Index=7 ID=7696487
File=/dev/nvidia7 Cores=56-111 CoreCnt=224 Links=0,0,0,0,0,0,0,-1
Flags=HAS_FILE,HAS_TYPE,ENV_NVML
*Questions : *

As far as I can tell, dgx09 is identical to all my non-MIG DGX

nodes in terms of configuration and hardware.  Why does scontrol report it
having 'h100' with a lower case 'h' unlike the other dgxs which report with
an upper case 'H'?

Why is dgx09 not accepting GPU jobs and afterwards it artificially

thinks that there are GPUs allocated even though no jobs are on the node?

Are there additional tests / configurations that I can do to probe

the differences between dgx09 and all my other nodes?
Best regards,
Lee
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2026

2025

2024

[slurm-users] Re: Invalid generic resource (gres) specification after RMA

< Gres=gpu:h100:8(S:0-1)

--