_node_config_validate: gres/gpu: Count changed on node (0 != 2)

20 Mar 2026

      Hey Slurm-users list,

while our regular gpu nodes are working fine, our on demand gpu nodes 
have a weird issue. They power up, I can ssh onto them and execute 
nvidia-smi on them without issue, but they are marked invalid and 
slurmctld logs

    _node_config_validate: gres/gpu: Count changed on node (0 != 2)

however, node show shows that the gpus are recognized and the gres.conf 
are stored on the worker nodes as expected and the node entries in the 
slurm.conf are fine, too:

    # slurm.conf
    NodeName=my_worker_node SocketsPerBoard=16 CoresPerSocket=1
    RealMemory=64075 MemSpecLimit=4000 State=CLOUD Gres=gpu:L4:2 # openstack

    # gres.conf on my_worker_node
    ubuntu@my_node:~$ cat /etc/slurm/gres.conf
    # GRES CONFIG
    Name=gpu Type=L4 File=/dev/nvidia0
    Name=gpu Type=L4 File=/dev/nvidia1

Thankful for any ideas and debugging ideas.

Best,
Xaver

PS:
By executing:

    sudo scontrol update NodeName=$(bibiname 0) Gres=
    sudo scontrol reconfigure
    sudo scontrol update NodeName=$(bibiname 0) state=RESUME reason=None

the node can be resumed. However, this is not a real solution.

Xaver Stiensmeier

Xaver Stiensmeier

Hermann Schwärzler

Xaver Stiensmeier

tags

participants (2)