Hello,

 

I am trying to rewrite my gres.conf file.

 

Before changes, this file was just like this:

NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0 Cores=0-11

NodeName=node-gpu-1 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1 Cores=12-23

NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0 Cores=0-11

NodeName=node-gpu-2 AutoDetect=off Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1 Cores=12-23

NodeName=node-gpu-3 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-11

NodeName=node-gpu-4 AutoDetect=off Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0 Cores=0-7

# you can seee that nodes node-gpu-1 and node-gpu-2 have two GPUs each one, whereas nodes node-gpu-3 and node-gpu-4 have only one GPU each one

 

 

And my slurmd.conf was this:

[...]

AccountingStorageTRES=gres/gpu

GresTypes=gpu

NodeName=node-gpu-1 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1

NodeName=node-gpu-2 CPUs=24 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1

NodeName=node-gpu-3 CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1

NodeName=node-gpu-4 CPUs=8 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1

NodeName=node-worker-[0-22] CPUs=12 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000

[...]

 

With this configuration, all seems works fine, except slurmctld.log reports:

[...]

error: _node_config_validate: gres/gpu: invalid GRES core specification (0-11) on node node-gpu-3

error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-1

error: _node_config_validate: gres/gpu: invalid GRES core specification (12-23) on node node-gpu-2

error: _node_config_validate: gres/gpu: invalid GRES core specification (0-7) on node node-gpu-4

[...]

 

However, even these errors, users can submit jobs and request GPUs resources.

 

 

 

Now, I have tried to reconfigure gres.conf and slurmd.conf in this way:

gres.conf:

Name=gpu Type=GeForceRTX2070 File=/dev/nvidia0

Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia1

Name=gpu Type=GeForceGTX1080Ti File=/dev/nvidia0

Name=gpu Type=GeForceGTX1080 File=/dev/nvidia1

Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0

Name=gpu Type=GeForceRTX3080 File=/dev/nvidia0

# there is no NodeName attribute

 

slurmd.conf:

[...]

NodeName=node-gpu-1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceRTX2070:1,gpu:GeForceGTX1080Ti:1

NodeName=node-gpu-2 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96000 TmpDisk=47000 Gres=gpu:GeForceGTX1080Ti:1,gpu:GeForceGTX1080:1

NodeName=node-gpu-3 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=23000 Gres=gpu:GeForceRTX3080:1

NodeName=node-gpu-4 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7800 Gres=gpu:GeForceRTX3080:1

NodeName=node-worker-[0-22] SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=47000

# there is no CPUs attribute

[...]

 

 

With this new configuration, nodes with GPU start correctly slurmd.service daemon, but nodes without GPU (node-worker-[0-22]) can’t start slurmd.service daemon and returns this error:

[...]

error: Waiting for gres.conf file /dev/nvidia0

fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

[...]

 

It seems SLURM is waiting that “node-workers” have also an nvidia GPU but not, theses nodes haven’t GPU... So, where is my configuration error?

 

I have read in https://slurm.schedmd.com/gres.conf.html about syntax and examples but it seems I’m doing some wrong.

 

Thanks!!