Issue with Enforcing GPU Usage Limits in Slurm

List overview All Threads
Download

newer

older

pam_slurm_adopt ambiguity in...

pam_slurm_adopt and multiple jobs...

lyz＠simplehpc.com

14 Apr 2025 14 Apr '25

1:27 p.m.

Hi, I am currently encountering an issue with Slurm's GPU resource limitation. I have attempted to restrict the number of GPUs a user can utilize by executing the following command: sacctmgr modify user lyz set MaxTRES=gres/gpu=2 This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected. How can I resolve this situation.

Show replies by date

Christopher Samuel

14 Apr 14 Apr

10:48 p.m.

On 4/14/25 6:27 am, lyz--- via slurm-users wrote:

...

This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected. How can I resolve this situation.

You need to make sure you're using cgroups to control access to devices for tasks, a starting point for reading up on this is here:

https://slurm.schedmd.com/cgroups.html

Good luck!

All the best, Chris

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz＠simplehpc.com

15 Apr 15 Apr

10:29 a.m.

Hi, Christopher. Thank you for your reply.

I have already modified the cgroup.conf configuration file in Slurm as follows:

vim /etc/slurm/cgroup.conf # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters # CgroupAutomount=yes

ConstrainCores=yes ConstrainRAMSpace=yes

Then I edited slurm.conf:

vim /etc/slurm/slurm.conf PrologFlags=CONTAIN TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup I restarted both the slurmctld service on the head node and the slurmd service on the compute nodes.

I also set resource limits for the user: [root@head1 ~]# sacctmgr show assoc format=cluster,account%35,user%35,partition,maxtres%35,GrpCPUs,GrpMem Cluster Account User Partition MaxTRES GrpCPUs GrpMem ---------- ----------------------------------- ----------------------------------- ---------- ----------------------------------- -------- ------- cluster lyz cluster lyz lyz gpus=2 80

However, when I specify CUDA device numbers in my .py script, for example:

import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" def test_gpu(): if torch.cuda.is_available(): torch.cuda.set_device(4) print("CUDA is available. PyTorch can use GPU.")

num_gpus = torch.cuda.device_count() print(f"Number of GPUs available: {num_gpus}")

current_device = torch.cuda.current_device() print(f"Current GPU device: {current_device}")

device_name = torch.cuda.get_device_name(current_device) print(f"Name of the current GPU device: {device_name}")

x = torch.rand(5, 5).cuda() print("Random tensor on GPU:") print(x) else: print("CUDA is not available. PyTorch will use CPU.") time.sleep(1000)

if __name__ == "__main__": test_gpu()

When I run this script, it still bypasses the resource restrictions set by cgroup.

Are there any other ways to solve this problem?

Sean Crosby

11:46 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

You need to add

ConstrainDevices=yes

To your cgroup.conf and restart slurmd on your nodes. This is the setting which gives access to only the GRES you request in the jobs

Sean

________________________________ From: lyz--- via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, April 15, 2025 8:29:41 PM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [EXT] [slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm

External email: Please exercise caution

Hi, Christopher. Thank you for your reply.

I have already modified the cgroup.conf configuration file in Slurm as follows:

vim /etc/slurm/cgroup.conf # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters # CgroupAutomount=yes

ConstrainCores=yes ConstrainRAMSpace=yes

Then I edited slurm.conf:

However, when I specify CUDA device numbers in my .py script, for example:

import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" def test_gpu(): if torch.cuda.is_available(): torch.cuda.set_device(4) print("CUDA is available. PyTorch can use GPU.")

num_gpus = torch.cuda.device_count() print(f"Number of GPUs available: {num_gpus}")

current_device = torch.cuda.current_device() print(f"Current GPU device: {current_device}")

device_name = torch.cuda.get_device_name(current_device) print(f"Name of the current GPU device: {device_name}")

x = torch.rand(5, 5).cuda() print("Random tensor on GPU:") print(x) else: print("CUDA is not available. PyTorch will use CPU.") time.sleep(1000)

if __name__ == "__main__": test_gpu()

When I run this script, it still bypasses the resource restrictions set by cgroup.

Are there any other ways to solve this problem?

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

lyz＠simplehpc.com

1:16 p.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hi, Sean. I followed your instructions and added ConstrainDevices=yes to the /etc/slurm/cgroup.conf file on the server node, and then restarted the relevant services on both the server and the client. However, I still can't enforce the restriction in the Python program.

It seems like the restriction applies to the physical GPU hardware, but it doesn't take effect for CUDA.

Sean Crosby

7:55 p.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

What version of Slurm are you running and what's the contents of your gres.conf file?

Sean

________________________________ From: lyz--- via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, April 15, 2025 11:16:40 PM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

External email: Please exercise caution

It seems like the restriction applies to the physical GPU hardware, but it doesn't take effect for CUDA.

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Christopher Samuel

8:15 p.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote:

...

What version of Slurm are you running and what's the contents of your gres.conf file?

Also what does this say?

systemctl cat slurmd | fgrep Delegate

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz＠simplehpc.com

16 Apr 16 Apr

2:03 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hi, Christ. Thank you for continuing paying attention to this issue. I followed your instuction. And This is the output:

[root@head1 ~]# systemctl cat slurmd | fgrep Delegate Delegate=yes

lyz

Christopher Samuel

4:56 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hiya,

On 4/15/25 7:03 pm, lyz--- via slurm-users wrote:

...

Hi, Christ. Thank you for continuing paying attention to this issue. I followed your instuction. And This is the output:

[root@head1 ~]# systemctl cat slurmd | fgrep Delegate Delegate=yes

That looks good to me, thanks for sharing that!

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz＠simplehpc.com

1:57 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hi, Sean. It's the latest slurm version. [root@head1 ~]# sinfo --version slurm 22.05.3

And this is my content of the gres.conf in gpu node. # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 Name=gpu File=/dev/nvidia4 Name=gpu File=/dev/nvidia5 Name=gpu File=/dev/nvidia6 Name=gpu File=/dev/nvidia7 # END AUTOGENERATED SECTION -- DO NOT REMOVE

Christopher Samuel

4:55 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

On 4/15/25 6:57 pm, lyz--- via slurm-users wrote:

...

Hi, Sean. It's the latest slurm version. [root@head1 ~]# sinfo --version slurm 22.05.3

That's quite old (and no longer supported), the oldest still supported version is 23.11.10 and 24.11.4 came out recently.

What does the cgroup.conf file on one of your compute nodes look like?

All the best, Chris

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz＠simplehpc.com

5:29 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hi ! Christ. The cgroup.conf on my gpu node is as same as head node. The content are as follow: CgroupAutomount=yes

ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes

I'll try slurm of high version.

lyz＠simplehpc.com

7:56 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hi! Christ. Thank you again for your instruction.

I've tried version 23.11.10. It does work.

When I ran the script using the following command, it successfully restricted the usage to the specified CUDA devices: srun -p gpu --gres=gpu:2 -nodelist=node11 python test.py

And when I checked the GPUs using this command, I saw the expected number of GPUs: srun -p gpu --gres=gpu:2 -nodelist=node11 --pty nvidia-smi

Thank you very much for your guidance.

Best luck Lyz

Chris Samuel

3:22 p.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hiya!

On 16/4/25 12:56 am, lyz--- via slurm-users wrote:

...

I've tried version 23.11.10. It does work.

Oh that's wonderful, so glad it helped! It did seem quite odd that it wasn't working for you before then. I wonder if this was a cgroups v1 vs cgroups v2 thing?

All the best, Chris

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

lyz＠simplehpc.com

17 Apr 17 Apr

1:45 a.m.

New subject: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

Hi Chris！

I didn't modify the cgroup configuration file; I only upgraded the Slurm version. After that, the limitations worked successfully.

It's quite odd.

lyz

100

Age (days ago)

103

Last active (days ago)

slurm-users@lists.schedmd.com

14 comments

4 participants

tags (0)

participants (4)

Chris Samuel
Christopher Samuel
lyz＠simplehpc.com
Sean Crosby