Hello Patrick,
On 11/13/24 12:01 PM, Patrick Begou via slurm-users wrote:
As using this GPU resource increase I would like to manage this resource with Gres to avoid usage conflict. But at this time my setup do not works as I can reach a GPU without reserving it:
srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this resource (checked with running nvidia-smi on the node). "tenibre-gpu" is a slurm partition with only this gpu node.
I think what you're looking for is the ConstrainDevices parameter in cgroup.conf.
See here:
- https://slurm.schedmd.com/archive/slurm-20.11.7/cgroup.conf.html
Best,
Hi Roberto,
thanks for pointing to this parameter. I set it, update all the
nodes, restart slurmd everywhere but it does not change the
behavior.
However, when looking in the slurmd log on the GPU node I notice
this information:
[2024-11-13T16:41:08.434] debug: CPUs:32
Boards:1 Sockets:8 CoresPerSocket:4 ThreadsPerCore:1
[2024-11-13T16:41:08.434] debug:
gres/gpu: init: loaded
[2024-11-13T16:41:08.434] WARNING: A
line in gres.conf for GRES gpu:A100-40 has 1 more configured
than expected in slurm.conf. Ignoring extra GRES.
[2024-11-13T16:41:08.434] WARNING: A
line in gres.conf for GRES gpu:A100-80 has 1 more configured
than expected in slurm.conf. Ignoring extra GRES.
[2024-11-13T16:41:08.434] debug:
gpu/generic: init: init: GPU Generic plugin loaded
[2024-11-13T16:41:08.434] topology/none:
init: topology NONE plugin loaded
[2024-11-13T16:41:08.434] route/default:
init: route default plugin loaded
[2024-11-13T16:41:08.434] CPU frequency
setting not configured for this node
[2024-11-13T16:41:08.434] debug: Resource
spec: No specialized cores configured by default on this node
[2024-11-13T16:41:08.434] debug: Resource
spec: Reserved system memory limit not configured for this node
[2024-11-13T16:41:08.434] debug: Reading
cgroup.conf file /etc/slurm/cgroup.conf
[2024-11-13T16:41:08.434] error:
MaxSwapPercent value (0.0%) is not a valid number
[2024-11-13T16:41:08.436] debug:
task/cgroup: init: core enforcement enabled
[2024-11-13T16:41:08.437] debug:
task/cgroup: task_cgroup_memory_init: task/cgroup/memory:
total:257281M allowed:100%(enforced), swap:0%(enforced),
max:100%(257281M) max+swap:100%(514562M) min:30M
kmem:100%(257281M permissive) min:30M swappiness:0(unset)
[2024-11-13T16:41:08.437] debug:
task/cgroup: init: memory enforcement enabled
[2024-11-13T16:41:08.438] debug:
task/cgroup: task_cgroup_devices_init: unable to open
/etc/slurm/cgroup_allowed_devices_file.conf: No such file or
directory
[2024-11-13T16:41:08.438] debug:
task/cgroup: init: device enforcement enabled
[2024-11-13T16:41:08.438] debug:
task/cgroup: init: task/cgroup: loaded
[2024-11-13T16:41:08.438] debug:
auth/munge: init: Munge authentication plugin loaded
So something is wrong in may gres.conf file I think as I ttry do
configure 2 different devices on the node may be?
## GPU setup on tenibre-gpu-0
NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0
Flags=nvidia_gpu_env
NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1
Flags=nvidia_gpu_env
Patrick