Hi, I am currently encountering an issue with Slurm's GPU resource limitation. I have attempted to restrict the number of GPUs a user can utilize by executing the following command: sacctmgr modify user lyz set MaxTRES=gres/gpu=2 This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected. How can I resolve this situation.
On 4/14/25 6:27 am, lyz--- via slurm-users wrote:
This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected. How can I resolve this situation.
You need to make sure you're using cgroups to control access to devices for tasks, a starting point for reading up on this is here:
https://slurm.schedmd.com/cgroups.html
Good luck!
All the best, Chris
Hi, Christopher. Thank you for your reply.
I have already modified the cgroup.conf configuration file in Slurm as follows:
vim /etc/slurm/cgroup.conf # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters # CgroupAutomount=yes
ConstrainCores=yes ConstrainRAMSpace=yes
Then I edited slurm.conf:
vim /etc/slurm/slurm.conf PrologFlags=CONTAIN TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup I restarted both the slurmctld service on the head node and the slurmd service on the compute nodes.
I also set resource limits for the user: [root@head1 ~]# sacctmgr show assoc format=cluster,account%35,user%35,partition,maxtres%35,GrpCPUs,GrpMem Cluster Account User Partition MaxTRES GrpCPUs GrpMem ---------- ----------------------------------- ----------------------------------- ---------- ----------------------------------- -------- ------- cluster lyz cluster lyz lyz gpus=2 80
However, when I specify CUDA device numbers in my .py script, for example:
import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" def test_gpu(): if torch.cuda.is_available(): torch.cuda.set_device(4) print("CUDA is available. PyTorch can use GPU.")
num_gpus = torch.cuda.device_count() print(f"Number of GPUs available: {num_gpus}")
current_device = torch.cuda.current_device() print(f"Current GPU device: {current_device}")
device_name = torch.cuda.get_device_name(current_device) print(f"Name of the current GPU device: {device_name}")
x = torch.rand(5, 5).cuda() print("Random tensor on GPU:") print(x) else: print("CUDA is not available. PyTorch will use CPU.") time.sleep(1000)
if __name__ == "__main__": test_gpu()
When I run this script, it still bypasses the resource restrictions set by cgroup.
Are there any other ways to solve this problem?
You need to add
ConstrainDevices=yes
To your cgroup.conf and restart slurmd on your nodes. This is the setting which gives access to only the GRES you request in the jobs
Sean
________________________________ From: lyz--- via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, April 15, 2025 8:29:41 PM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [EXT] [slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm
External email: Please exercise caution
Hi, Christopher. Thank you for your reply.
I have already modified the cgroup.conf configuration file in Slurm as follows:
vim /etc/slurm/cgroup.conf # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters # CgroupAutomount=yes
ConstrainCores=yes ConstrainRAMSpace=yes
Then I edited slurm.conf:
vim /etc/slurm/slurm.conf PrologFlags=CONTAIN TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup I restarted both the slurmctld service on the head node and the slurmd service on the compute nodes.
I also set resource limits for the user: [root@head1 ~]# sacctmgr show assoc format=cluster,account%35,user%35,partition,maxtres%35,GrpCPUs,GrpMem Cluster Account User Partition MaxTRES GrpCPUs GrpMem ---------- ----------------------------------- ----------------------------------- ---------- ----------------------------------- -------- ------- cluster lyz cluster lyz lyz gpus=2 80
However, when I specify CUDA device numbers in my .py script, for example:
import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" def test_gpu(): if torch.cuda.is_available(): torch.cuda.set_device(4) print("CUDA is available. PyTorch can use GPU.")
num_gpus = torch.cuda.device_count() print(f"Number of GPUs available: {num_gpus}")
current_device = torch.cuda.current_device() print(f"Current GPU device: {current_device}")
device_name = torch.cuda.get_device_name(current_device) print(f"Name of the current GPU device: {device_name}")
x = torch.rand(5, 5).cuda() print("Random tensor on GPU:") print(x) else: print("CUDA is not available. PyTorch will use CPU.") time.sleep(1000)
if __name__ == "__main__": test_gpu()
When I run this script, it still bypasses the resource restrictions set by cgroup.
Are there any other ways to solve this problem?
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi, Sean. I followed your instructions and added ConstrainDevices=yes to the /etc/slurm/cgroup.conf file on the server node, and then restarted the relevant services on both the server and the client. However, I still can't enforce the restriction in the Python program.
It seems like the restriction applies to the physical GPU hardware, but it doesn't take effect for CUDA.
What version of Slurm are you running and what's the contents of your gres.conf file?
Sean
________________________________ From: lyz--- via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, April 15, 2025 11:16:40 PM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm
External email: Please exercise caution
Hi, Sean. I followed your instructions and added ConstrainDevices=yes to the /etc/slurm/cgroup.conf file on the server node, and then restarted the relevant services on both the server and the client. However, I still can't enforce the restriction in the Python program.
It seems like the restriction applies to the physical GPU hardware, but it doesn't take effect for CUDA.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote:
What version of Slurm are you running and what's the contents of your gres.conf file?
Also what does this say?
systemctl cat slurmd | fgrep Delegate
Hi, Christ. Thank you for continuing paying attention to this issue. I followed your instuction. And This is the output:
[root@head1 ~]# systemctl cat slurmd | fgrep Delegate Delegate=yes
lyz
Hiya,
On 4/15/25 7:03 pm, lyz--- via slurm-users wrote:
Hi, Christ. Thank you for continuing paying attention to this issue. I followed your instuction. And This is the output:
[root@head1 ~]# systemctl cat slurmd | fgrep Delegate Delegate=yes
That looks good to me, thanks for sharing that!
Hi, Sean. It's the latest slurm version. [root@head1 ~]# sinfo --version slurm 22.05.3
And this is my content of the gres.conf in gpu node. # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 Name=gpu File=/dev/nvidia4 Name=gpu File=/dev/nvidia5 Name=gpu File=/dev/nvidia6 Name=gpu File=/dev/nvidia7 # END AUTOGENERATED SECTION -- DO NOT REMOVE
On 4/15/25 6:57 pm, lyz--- via slurm-users wrote:
Hi, Sean. It's the latest slurm version. [root@head1 ~]# sinfo --version slurm 22.05.3
That's quite old (and no longer supported), the oldest still supported version is 23.11.10 and 24.11.4 came out recently.
What does the cgroup.conf file on one of your compute nodes look like?
All the best, Chris
Hi ! Christ. The cgroup.conf on my gpu node is as same as head node. The content are as follow: CgroupAutomount=yes
ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes
I'll try slurm of high version.
Hi! Christ. Thank you again for your instruction.
I've tried version 23.11.10. It does work.
When I ran the script using the following command, it successfully restricted the usage to the specified CUDA devices: srun -p gpu --gres=gpu:2 -nodelist=node11 python test.py
And when I checked the GPUs using this command, I saw the expected number of GPUs: srun -p gpu --gres=gpu:2 -nodelist=node11 --pty nvidia-smi
Thank you very much for your guidance.
Best luck Lyz
Hiya!
On 16/4/25 12:56 am, lyz--- via slurm-users wrote:
I've tried version 23.11.10. It does work.
Oh that's wonderful, so glad it helped! It did seem quite odd that it wasn't working for you before then. I wonder if this was a cgroups v1 vs cgroups v2 thing?
All the best, Chris
Hi Chris!
I didn't modify the cgroup configuration file; I only upgraded the Slurm version. After that, the limitations worked successfully.
It's quite odd.
lyz