[slurm-users] gres/gpu count reported lower than configured

Geleßus, Achim A.Gelessus at jacobs-university.de
Fri Oct 21 13:38:39 UTC 2022


Hello Slurm Admins,

 I have set up Slurm for a GPU-cluster. The basic installation without
gres/gpu works well. Now I try adding the GPUs to the Slurm configuration.
All attempts have failed so far and I always get with sinfo -R the message

gres/gpu count reported lower than configured ( 0 < 2 )

With nvidia-smi the GPUs are found and running jobs on them works well.
I have tried to get rid off the above error by updating the state to IDLE with
scontrol. That attempt also failed with error message

slurm_update error: Invalid node state specified

I ran slurmd on the GPU node with debug5 level. From slurmd.log I see that
gres.conf is found and gres_gpu.so / gpu_genric.so are loaded.

My Slurm configuration is as follows:

slurm.conf:
GresTypes=gpu
NodeName=hpc-node14 CPUs=128 RealMemory=515815 Sockets=2 CoresPerSocket=64 ThreadsPerCore=1 Gres=gpu:2 State=UNKNOWN

gres.conf:
NodeName=hpc-node[01-14] Name=gpu File=/dev/nvidia[0-1]

Does anyone know what is wrong and how to fix that problem?
Thank you.


Best wishes
Achim


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221021/7d6a38cb/attachment.htm>


More information about the slurm-users mailing list