Hi,
I'm using slurm on a small 8 nodes cluster. I've recently added one GPU node with two Nvidia A100, one with 40Gb of RAM and one with 80Gb.
As using this GPU resource increase I would like to manage this
resource with Gres to avoid usage conflict. But at this time my
setup do not works as I can reach a GPU without reserving it:
srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this
resource (checked with running nvidia-smi on the node).
"tenibre-gpu" is a slurm partition with only this gpu node.
From the documentation I've created a gres.conf file and it is propagated on all the nodes (9 compute nodes, 1 login node and the management node) and slurmd has been restarted.
gres.conf is:*
## GPU setup on tenibre-gpu-0
NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0 Flags=nvidia_gpu_env
NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1 Flags=nvidia_gpu_env
In slurm.conf I have checked these flags:
## Basic scheduling
## Generic resources
SelectTypeParameters=CR_Core_Memory
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
## Nodes list
....
Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
....
#partitions
PartitionName=tenibre-gpu MaxTime=48:00:00 DefaultTime=12:00:00 DefMemPerCPU=4096 MaxMemPerCPU=8192 Shared=YES State=UP Nodes=tenibre-gpu-0
...
May be I've missed something ? I'm running Slurm 20.11.7-1.
Thanks for your advices.
Patrick