Hey Slurm-users list,

meanwhile I was able to find:

[2026-03-23T12:58:16.105] debug:  gres/gpu: init: loaded                                                             
[2026-03-23T12:58:16.105] debug:  gpu/generic: init: init: GPU Generic plugin loaded                                 
[2026-03-23T12:58:16.105] warning: Ignoring file-less GPU gpu:L4 from final GRES list                                
[2026-03-23T12:58:16.105] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia0                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.105] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia1                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.105] debug:  gres/gpu: init: loaded                                                             
[2026-03-23T12:58:16.106] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia0                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.106] debug:  skipping GRES for NodeName=my_worker_node  Name=gpu File=/dev/nv
dia1                                                                                                                 
                                                                                                                     
[2026-03-23T12:58:16.106] debug:  gres/gpu: init: loaded                                                             

so I am wondering whether that is the issue. I also noticed that after powering up the node without requesting a gpu (works), scheduling to the node by requesting a GPU is not an issue.

Best,
Xaver

On 3/20/26 17:21, Xaver Stiensmeier wrote:

Hey Slurm-users list,

while our regular gpu nodes are working fine, our on demand gpu nodes have a weird issue. They power up, I can ssh onto them and execute nvidia-smi on them without issue, but they are marked invalid and slurmctld logs

_node_config_validate: gres/gpu: Count changed on node (0 != 2)

however, node show shows that the gpus are recognized and the gres.conf are stored on the worker nodes as expected and the node entries in the slurm.conf are fine, too:

# slurm.conf
NodeName=my_worker_node SocketsPerBoard=16 CoresPerSocket=1 RealMemory=64075 MemSpecLimit=4000 State=CLOUD Gres=gpu:L4:2 # openstack

# gres.conf on my_worker_node
ubuntu@my_node:~$ cat /etc/slurm/gres.conf 
# GRES CONFIG
Name=gpu Type=L4 File=/dev/nvidia0
Name=gpu Type=L4 File=/dev/nvidia1

Thankful for any ideas and debugging ideas.

Best,
Xaver

PS:
By executing:

sudo scontrol update NodeName=$(bibiname 0) Gres=
sudo scontrol reconfigure
sudo scontrol update NodeName=$(bibiname 0) state=RESUME reason=None

the node can be resumed. However, this is not a real solution.