Le 13/11/2024 à 15:45, Roberto Polverelli Monti via slurm-users a écrit :
Hello Patrick,
On 11/13/24 12:01 PM, Patrick Begou via slurm-users wrote:
As using this GPU resource increase I would like to manage this resource with Gres to avoid usage conflict. But at this time my setup do not works as I can reach a GPU without reserving it:
srun -n 1 -p tenibre-gpu ./a.out
can use a GPU even if the reservation do not specify this resource (checked with running nvidia-smi on the node). "tenibre-gpu" is a slurm partition with only this gpu node.
I think what you're looking for is the ConstrainDevices parameter in cgroup.conf.
See here:
Best,
Hi Roberto,
thanks for pointing to this parameter. I set it, update all the nodes, restart slurmd everywhere but it does not change the behavior. However, when looking in the slurmd log on the GPU node I notice this information:
[2024-11-13T16:41:08.434] debug: CPUs:32 Boards:1 Sockets:8 CoresPerSocket:4 ThreadsPerCore:1 *[2024-11-13T16:41:08.434] debug: gres/gpu: init: loaded* *[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES gpu:A100-40 has 1 more configured than expected in slurm.conf. Ignoring extra GRES.* *[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES gpu:A100-80 has 1 more configured than expected in slurm.conf. Ignoring extra GRES.* [2024-11-13T16:41:08.434] debug: gpu/generic: init: init: GPU Generic plugin loaded [2024-11-13T16:41:08.434] topology/none: init: topology NONE plugin loaded [2024-11-13T16:41:08.434] route/default: init: route default plugin loaded [2024-11-13T16:41:08.434] CPU frequency setting not configured for this node [2024-11-13T16:41:08.434] debug: Resource spec: No specialized cores configured by default on this node [2024-11-13T16:41:08.434] debug: Resource spec: Reserved system memory limit not configured for this node [2024-11-13T16:41:08.434] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2024-11-13T16:41:08.434] error: MaxSwapPercent value (0.0%) is not a valid number [2024-11-13T16:41:08.436] debug: task/cgroup: init: core enforcement enabled [2024-11-13T16:41:08.437] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257281M allowed:100%(enforced), swap:0%(enforced), max:100%(257281M) max+swap:100%(514562M) min:30M kmem:100%(257281M permissive) min:30M swappiness:0(unset) [2024-11-13T16:41:08.437] debug: task/cgroup: init: memory enforcement enabled *[2024-11-13T16:41:08.438] debug: task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory* [2024-11-13T16:41:08.438] debug: task/cgroup: init: device enforcement enabled [2024-11-13T16:41:08.438] debug: task/cgroup: init: task/cgroup: loaded [2024-11-13T16:41:08.438] debug: auth/munge: init: Munge authentication plugin loaded
So something is wrong in may gres.conf file I think as I ttry do configure 2 different devices on the node may be?
## GPU setup on tenibre-gpu-0 NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0 Flags=nvidia_gpu_env NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1 Flags=nvidia_gpu_env
Patrick