Wrong MaxRSS Behavior with cgroup v2 in Slurm

List overview All Threads
Download

newer

older

Doc Clarification: Heterogeneous...

Re: Job information if job is...

Guillaume COCHARD

22 May 2025 22 May '25

9:10 a.m.

Hello,

We've noticed a recent change in how MaxRSS is reported on our cluster. Specifically, the MaxRSS value for many jobs now often matches the allocated memory, which was not the case previously. It appears this change is due to how Slurm accounts for memory when copying large files, likely as a result of moving from cgroup v1 to cgroup v2.

Here’s a simple example:

copy_file.sh #!/bin/bash cp /distributed/filesystem/file5G /tmp cp /tmp/file5G ~

Two jobs with different memory allocations:

Job 1 sbatch -c 1 --mem=1G copy_file.sh seff <jobid> Memory Utilized: 1021.87 MB Memory Efficiency: 99.79% of 1.00 GB

Job 2 sbatch -c 1 --mem=10G copy_file.sh seff <jobid> Memory Utilized: 4.02 GB Memory Efficiency: 40.21% of 10.00 GB

With cgroup v1, this script typically showed minimal memory usage. Now, under cgroup v2, memory usage appears inflated and depends on the allocated memory, which seems wrong.

I believe this behavior aligns with similar issues raised by the Kubernetes community [1], and is consistent with how memory.current behaves in cgroup v2 [3].

According to Slurm’s documentation about cgroup v2, "this plugin provides cgroup's memory.current value from the memory interface, which is not equal to the RSS value provided by procfs. Nevertheless it is the same value that the kernel uses in its OOM killer logic." [2]

While technically correct, this seems to mark a significant change in what MaxRSS and "Memory Efficiency" actually measure and renders those metrics almost useless.

Our Configuration: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity

Question: Is there a way to restore more realistic MaxRSS values — specifically, ones that exclude file-backed page cache — while still using cgroup v2?

Thanks, Guillaume

References:

[1] https://github.com/kubernetes/kubernetes/issues/118916 [2] https://slurm.schedmd.com/cgroup_v2.html#limitations [3] https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html

Show replies by date

Stijn De Weirdt

22 May 22 May

9:33 a.m.

salut guillaume,

nothing else is different between the v1 and v2 setup? (/tmp is tmpfs on v2 setup perhaps?)

stijn

On 5/22/25 11:10, Guillaume COCHARD via slurm-users wrote:

...

Hello,

We've noticed a recent change in how MaxRSS is reported on our cluster. Specifically, the MaxRSS value for many jobs now often matches the allocated memory, which was not the case previously. It appears this change is due to how Slurm accounts for memory when copying large files, likely as a result of moving from cgroup v1 to cgroup v2.

Here’s a simple example:

copy_file.sh #!/bin/bash cp /distributed/filesystem/file5G /tmp cp /tmp/file5G ~

Two jobs with different memory allocations:

Job 1 sbatch -c 1 --mem=1G copy_file.sh seff <jobid> Memory Utilized: 1021.87 MB Memory Efficiency: 99.79% of 1.00 GB

Job 2 sbatch -c 1 --mem=10G copy_file.sh seff <jobid> Memory Utilized: 4.02 GB Memory Efficiency: 40.21% of 10.00 GB

With cgroup v1, this script typically showed minimal memory usage. Now, under cgroup v2, memory usage appears inflated and depends on the allocated memory, which seems wrong.

I believe this behavior aligns with similar issues raised by the Kubernetes community [1], and is consistent with how memory.current behaves in cgroup v2 [3].

According to Slurm’s documentation about cgroup v2, "this plugin provides cgroup's memory.current value from the memory interface, which is not equal to the RSS value provided by procfs. Nevertheless it is the same value that the kernel uses in its OOM killer logic." [2]

While technically correct, this seems to mark a significant change in what MaxRSS and "Memory Efficiency" actually measure and renders those metrics almost useless.

Our Configuration: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity

Question: Is there a way to restore more realistic MaxRSS values — specifically, ones that exclude file-backed page cache — while still using cgroup v2?

Thanks, Guillaume

References:

[1] https://github.com/kubernetes/kubernetes/issues/118916 [2] https://slurm.schedmd.com/cgroup_v2.html#limitations [3] https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html

Guillaume COCHARD

9:54 a.m.

Hi,

No changes. My example used /tmp, but the behaviour is the same for copies between any filesystems (e.g. from a distributed fs to another distributed fs).

Guillaume

----- Mail original ----- De: "Stijn De Weirdt via slurm-users" slurm-users@lists.schedmd.com À: slurm-users@lists.schedmd.com Envoyé: Jeudi 22 Mai 2025 11:33:18 Objet: [slurm-users] Re: Wrong MaxRSS Behavior with cgroup v2 in Slurm

salut guillaume,

nothing else is different between the v1 and v2 setup? (/tmp is tmpfs on v2 setup perhaps?)

stijn

On 5/22/25 11:10, Guillaume COCHARD via slurm-users wrote:

...

Hello,

We've noticed a recent change in how MaxRSS is reported on our cluster. Specifically, the MaxRSS value for many jobs now often matches the allocated memory, which was not the case previously. It appears this change is due to how Slurm accounts for memory when copying large files, likely as a result of moving from cgroup v1 to cgroup v2.

Here’s a simple example:

copy_file.sh #!/bin/bash cp /distributed/filesystem/file5G /tmp cp /tmp/file5G ~

Two jobs with different memory allocations:

Job 1 sbatch -c 1 --mem=1G copy_file.sh seff <jobid> Memory Utilized: 1021.87 MB Memory Efficiency: 99.79% of 1.00 GB

Job 2 sbatch -c 1 --mem=10G copy_file.sh seff <jobid> Memory Utilized: 4.02 GB Memory Efficiency: 40.21% of 10.00 GB

With cgroup v1, this script typically showed minimal memory usage. Now, under cgroup v2, memory usage appears inflated and depends on the allocated memory, which seems wrong.

I believe this behavior aligns with similar issues raised by the Kubernetes community [1], and is consistent with how memory.current behaves in cgroup v2 [3].

According to Slurm’s documentation about cgroup v2, "this plugin provides cgroup's memory.current value from the memory interface, which is not equal to the RSS value provided by procfs. Nevertheless it is the same value that the kernel uses in its OOM killer logic." [2]

While technically correct, this seems to mark a significant change in what MaxRSS and "Memory Efficiency" actually measure and renders those metrics almost useless.

Our Configuration: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity

Question: Is there a way to restore more realistic MaxRSS values — specifically, ones that exclude file-backed page cache — while still using cgroup v2?

Thanks, Guillaume

References:

[1] https://github.com/kubernetes/kubernetes/issues/118916 [2] https://slurm.schedmd.com/cgroup_v2.html#limitations [3] https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Sean Crosby

11:08 a.m.

What kernel are you using? I had a similar issue in an older RHEL 9 kernel which has now been fixed

Sean

________________________________ From: Guillaume COCHARD via slurm-users slurm-users@lists.schedmd.com Sent: Thursday, May 22, 2025 7:10:11 PM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [EXT] [slurm-users] Wrong MaxRSS Behavior with cgroup v2 in Slurm

External email: Please exercise caution

Hello,

Here’s a simple example:

copy_file.sh #!/bin/bash cp /distributed/filesystem/file5G /tmp cp /tmp/file5G ~

Two jobs with different memory allocations:

Job 1 sbatch -c 1 --mem=1G copy_file.sh seff <jobid> Memory Utilized: 1021.87 MB Memory Efficiency: 99.79% of 1.00 GB

Job 2 sbatch -c 1 --mem=10G copy_file.sh seff <jobid> Memory Utilized: 4.02 GB Memory Efficiency: 40.21% of 10.00 GB

With cgroup v1, this script typically showed minimal memory usage. Now, under cgroup v2, memory usage appears inflated and depends on the allocated memory, which seems wrong.

I believe this behavior aligns with similar issues raised by the Kubernetes community [1], and is consistent with how memory.current behaves in cgroup v2 [3].

While technically correct, this seems to mark a significant change in what MaxRSS and "Memory Efficiency" actually measure and renders those metrics almost useless.

Our Configuration: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity

Question: Is there a way to restore more realistic MaxRSS values — specifically, ones that exclude file-backed page cache — while still using cgroup v2?

Thanks, Guillaume

References:

[1] https://github.com/kubernetes/kubernetes/issues/118916 [2] https://slurm.schedmd.com/cgroup_v2.html#limitations [3] https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Guillaume COCHARD

19 Jun 19 Jun

12:19 p.m.

Hello,

Sorry for the delay, we had to do some tests ans updates. We updated the kernel to 5.14.0-570.21.1, which is the last kernel available for RHEL 9 (we use RHEL 9.6). The issue is still present.

We are now using "JobAcctGatherType=jobacct_gather/linux" instead of "JobAcctGatherType=jobacct_gather/cgroup" [1], but we had some multiprocessing jobs with up to 1500% memory efficiency (which is nice, but very wrong). Since it could come from a mishap while counting shared memory we'll try to use PSS instead of RSS [2] (very instructive explaination of those terms [3]), but we would love to be able to use the cgroup value again. If anyone has any idea how to do so we would be grateful.

Thanks, Guillaume

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherType [2] [ https://slurm.schedmd.com/slurm.conf.html#OPT_UsePss | https://slurm.schedmd.com/slurm.conf.html#OPT_UsePss ] [3] https://stackoverflow.com/questions/22372960/is-this-explanation-about-vss-r...

De: "Sean Crosby" scrosby@unimelb.edu.au À: "Guillaume COCHARD" guillaume.cochard@cc.in2p3.fr, slurm-users@schedmd.com Envoyé: Jeudi 22 Mai 2025 13:08:16 Objet: Re: Wrong MaxRSS Behavior with cgroup v2 in Slurm

What kernel are you using? I had a similar issue in an older RHEL 9 kernel which has now been fixed

Sean

From: Guillaume COCHARD via slurm-users slurm-users@lists.schedmd.com Sent: Thursday, May 22, 2025 7:10:11 PM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [EXT] [slurm-users] Wrong MaxRSS Behavior with cgroup v2 in Slurm External email: Please exercise caution

Hello,

Here’s a simple example:

copy_file.sh #!/bin/bash cp /distributed/filesystem/file5G /tmp cp /tmp/file5G ~

Two jobs with different memory allocations:

Job 1 sbatch -c 1 --mem=1G copy_file.sh seff <jobid> Memory Utilized: 1021.87 MB Memory Efficiency: 99.79% of 1.00 GB

Job 2 sbatch -c 1 --mem=10G copy_file.sh seff <jobid> Memory Utilized: 4.02 GB Memory Efficiency: 40.21% of 10.00 GB

With cgroup v1, this script typically showed minimal memory usage. Now, under cgroup v2, memory usage appears inflated and depends on the allocated memory, which seems wrong.

I believe this behavior aligns with similar issues raised by the Kubernetes community [1], and is consistent with how memory.current behaves in cgroup v2 [3].

While technically correct, this seems to mark a significant change in what MaxRSS and "Memory Efficiency" actually measure and renders those metrics almost useless.

Our Configuration: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity

Question: Is there a way to restore more realistic MaxRSS values — specifically, ones that exclude file-backed page cache — while still using cgroup v2?

Thanks, Guillaume

References:

[1] [ https://github.com/kubernetes/kubernetes/issues/118916 | https://github.com/kubernetes/kubernetes/issues/118916 ] [2] [ https://slurm.schedmd.com/cgroup_v2.html#limitations | https://slurm.schedmd.com/cgroup_v2.html#limitations ] [3] [ https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html | https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html ]

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Age (days ago)

Last active (days ago)

slurm-users@lists.schedmd.com

4 comments

3 participants

tags (0)

participants (3)

Guillaume COCHARD
Sean Crosby
Stijn De Weirdt