Hi,
I'm using slurm on a small 8 nodes cluster. I've recently added one GPU node with two Nvidia A100, one with 40Gb of RAM and one with 80Gb.
As using this GPU resource increase I would like to manage this resource with Gres to avoid usage conflict. But at this time my setup do not works as I can reach a GPU without reserving it:
srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this resource (checked with running nvidia-smi on the node). "tenibre-gpu" is a slurm partition with only this gpu node.
From the documentation I've created a gres.conf file and it is propagated on all the nodes (9 compute nodes, 1 login node and the management node) and slurmd has been restarted.
gres.conf is:*
## GPU setup on tenibre-gpu-0 NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0 Flags=nvidia_gpu_env NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1 Flags=nvidia_gpu_env * *
In slurm.conf I have checked these flags:
## Basic scheduling SelectTypeParameters=CR_Core_Memory SchedulerType=sched/backfill SelectType=select/cons_tres
## Generic resources GresTypes=gpu
## Nodes list .... Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN ....
#partitions PartitionName=tenibre-gpu MaxTime=48:00:00 DefaultTime=12:00:00 DefMemPerCPU=4096 MaxMemPerCPU=8192 Shared=YES State=UP Nodes=tenibre-gpu-0 ...
May be I've missed something ? I'm running Slurm 20.11.7-1.
Thanks for your advices.
Patrick