Hello,
I am trying to rewrite my gres.conf file.
Before changes, this file was just like this:
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1 Cores=12-23
NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-11
NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-7
# you can seee that nodes node-gpu-1 and node-gpu-2 have two GPUs each one, whereas nodes node-gpu-3 and node-gpu-4 have only one GPU each one
And my slurmd.conf was this:
[...]
AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000
[...]
With this configuration, all seems works fine, except slurmctld.log reports:
[...]
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-11) on node node-gpu-3
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-1
error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-2
error: _node_config_validate: gres/gpu: invalid GRES core specification (0-7) on node node-gpu-4
[...]
However, even these errors, users can submit jobs and request GPUs resources.
Now, I have tried to reconfigure gres.conf and slurmd.conf in this way:
gres.conf:
Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1
Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0
Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0
# there is no NodeName attribute
slurmd.conf:
[...]
NodeName=node-gpu-1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1
NodeName=node-gpu-2 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1
NodeName=node-gpu-3 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1
NodeName=node-gpu-4 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1
NodeName=node-worker-[0-22] SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000
# there is no CPUs attribute
[...]
With this new configuration, nodes with GPU start correctly slurmd.service daemon, but nodes without GPU (node-worker-[0-22]) can’t start slurmd.service daemon and returns this error:
[...]
error: Waiting for gres.conf file /dev/nvidia0
fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
[...]
It seems SLURM is waiting that “node-workers” have also an nvidia GPU but not, theses nodes haven’t GPU... So, where is my configuration error?
I have read in
https://slurm.schedmd.com/gres.conf.html about syntax and examples but it seems I’m doing some wrong.
Thanks!!