[slurm-users] gres/gpu count reported lower than configured
Geleßus, Achim
A.Gelessus at jacobs-university.de
Fri Oct 21 13:38:39 UTC 2022
Hello Slurm Admins,
I have set up Slurm for a GPU-cluster. The basic installation without
gres/gpu works well. Now I try adding the GPUs to the Slurm configuration.
All attempts have failed so far and I always get with sinfo -R the message
gres/gpu count reported lower than configured ( 0 < 2 )
With nvidia-smi the GPUs are found and running jobs on them works well.
I have tried to get rid off the above error by updating the state to IDLE with
scontrol. That attempt also failed with error message
slurm_update error: Invalid node state specified
I ran slurmd on the GPU node with debug5 level. From slurmd.log I see that
gres.conf is found and gres_gpu.so / gpu_genric.so are loaded.
My Slurm configuration is as follows:
slurm.conf:
GresTypes=gpu
NodeName=hpc-node14 CPUs=128 RealMemory=515815 Sockets=2 CoresPerSocket=64 ThreadsPerCore=1 Gres=gpu:2 State=UNKNOWN
gres.conf:
NodeName=hpc-node[01-14] Name=gpu File=/dev/nvidia[0-1]
Does anyone know what is wrong and how to fix that problem?
Thank you.
Best wishes
Achim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221021/7d6a38cb/attachment.htm>
More information about the slurm-users
mailing list