Hello,
I am trying to rewrite my gres.conf file.
Before changes, this file was just like this: NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23 NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1 Cores=12-23 NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-11 NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-7 # you can seee that nodes node-gpu-1 and node-gpu-2 have two GPUs each one, whereas nodes node-gpu-3 and node-gpu-4 have only one GPU each one
And my slurmd.conf was this: [...] AccountingStorageTRES=gres/gpu GresTypes=gpu NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1 NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1 NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1 NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1 NodeName=node-worker-[0-22] CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000 [...]
With this configuration, all seems works fine, except slurmctld.log reports: [...] error: _node_config_validate: gres/gpu: invalid GRES core specification (0-11) on node node-gpu-3 error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-1 error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-2 error: _node_config_validate: gres/gpu: invalid GRES core specification (0-7) on node node-gpu-4 [...]
However, even these errors, users can submit jobs and request GPUs resources.
Now, I have tried to reconfigure gres.conf and slurmd.conf in this way: gres.conf: Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0 Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1 Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 # there is no NodeName attribute
slurmd.conf: [...] NodeName=node-gpu-1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1 NodeName=node-gpu-2 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1 NodeName=node-gpu-3 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1 NodeName=node-gpu-4 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1 NodeName=node-worker-[0-22] SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000 # there is no CPUs attribute [...]
With this new configuration, nodes with GPU start correctly slurmd.service daemon, but nodes without GPU (node-worker-[0-22]) can't start slurmd.service daemon and returns this error: [...] error: Waiting for gres.conf file /dev/nvidia0 fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory [...]
It seems SLURM is waiting that "node-workers" have also an nvidia GPU but not, theses nodes haven't GPU... So, where is my configuration error?
I have read in https://slurm.schedmd.com/gres.conf.html about syntax and examples but it seems I'm doing some wrong.
Thanks!!