[slurm-users] ConstrainRAMSpace=yes and page cache?

Fri Jun 14 11:21:24 UTC 2019

Hi Jürgen,

I'm not aware of a Slurm-onic way of doing this. As you've said this is the behaviour cgroups, which Slurm is employing. As I understand it, upon allocation the page cache is accounted within the calling process's cgroup, and I'm not aware of way of preventing the memory resource controller from not accounting on the page cache.

I would also advise caution with using ConstrainKmemSpace=yes. We've recently turned it off (ConstrainKmemSpace=no) as we were experiencing a number of job failures due to memory leaks. (see https://bugzilla.redhat.com/show_bug.cgi?id=1507149, https://bugs.schedmd.com/show_bug.cgi?id=3694, https://slurm.schedmd.com/archive/slurm-18.08-latest/news.html - search for ConstrainKmemSpace). In Slurm v18.08 its disabled by default.

As an aside - a helpful way to see exactly how many pages are allocated to a file you may consider using either vmtouch (https://hoytech.com/vmtouch/) and or linux-ftools (https://code.google.com/archive/p/linux-ftools/). Both use the mincore/fadvise to query, allocate, evict page cache for specific files or directories.

---
Sam Gallop

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Juergen Salk
Sent: 14 June 2019 09:14
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

Dear Kilian,

thanks for pointing this out. I should have mentioned that I had already browsed the croups.conf man page up and down but did not find any specific hints on how to achieve the desired behavior. Maybe I am still missing something obvious?

Also the kernel cgroups documentation indicates that page cache and anonymous memory, are both tied to userland memory[1]:

--- snip ---
While not completely water-tight, all major memory usages by a given cgroup are tracked so that the total memory consumption can be accounted and controlled to a reasonable extent. Currently, the following types of memory usages are tracked.

    Userland memory - page cache and anonymous memory.
    Kernel data structures such as dentries and inodes.
    TCP socket buffers.
--- snip ---

That's why I'm somewhat unsure whether KmemSpace options in cgroups.conf can address this issue.

I guess my question simply boils down to whether there is a Slurm-ish way to prevent active page caches from being counted against memory constraints when ConstrainRAMSpace=yes is set?

Best regards
Jürgen

[1] https://www.kernel.org/doc/html/v4.18/admin-guide/cgroup-v2.html

--
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz) Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471

* Kilian Cavalotti <kilian.cavalotti.work at gmail.com> [190613 17:27]:
> Hi Jürgen,
> 
> I would take a look at the various *KmemSpace options in
> cgroups.conf, they can certainly help with this.
> 
> Cheers, -- Kilian
> 
> On Thu, Jun 13, 2019 at 2:41 PM Juergen Salk
> <juergen.salk at uni-ulm.de> wrote:
> >
> > Dear all,
> >
> > I'm just starting to get used to Slurm and play around with it in
> > a small test environment within our old cluster.
> >
> > For our next system we will probably have to abandon our current
> > exclusive user node access policy in favor of a shared user
> > policy, i.e. jobs from different users will then run side by side
> > on the same node at the same time. In order to prevent the jobs
> > from interfering with each other, I have set both
> > ConstrainCores=yes and ConstrainRAMSpace=yes in cgroups.conf,
> > which works as expected for limiting the memory of the processes
> > to the value requested at job submission (e.g. by --mem=...
> > option).
> >
> > However, I've noticed that ConstrainRAMSpace=yes does also cap the
> > available page cache for which the Linux kernel normally exploits
> > any unused areas of the memory in a flexible way. This may result
> > in a significant performance impact as we do have quite a number
> > of IO demanding applications (predominated by read operations)
> > that are known to benefit a lot from page caching.
> >
> > Here comes a small example to illustrate this issue. The job
> > writes a 16 GB file to a local scratch file system, measures the
> > amount of data cached in memory and then reads the file previously
> > written.
> >
> > $ cat job.slurm #!/bin/bash #SBATCH --partition=standard #SBATCH
> > --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --time=00:10:00
> >
> > # Get amount of data cached in memory before writing the file
> > cached1=`awk '$1=="Cached:" {print $2}' /proc/meminfo`
> >
> > # Write 16 GB file to local scratch SSD dd if=/dev/zero
> > of=$SCRATCH/testfile count=16 bs=1024M
> >
> > # Get amount of data cached in memory after writing the file
> > cached2=`awk '$1=="Cached:" {print $2}' /proc/meminfo`
> >
> > # Print difference of data cached in memory echo -e "\nIncreased
> > cached data by $(((cached2-cached1)/1000000)) GB\n"
> >
> > # Read the file previously written dd if=$SCRATCH/testfile
> > of=/dev/null count=16 bs=1024M
> >
> > $
> >
> > For reference, this is the result *without* ConstrainRAMSpace=yes
> > set in cgroups.conf and submitted with `sbatch --mem=2G
> > --gres=scratch:16 job.slurm´
> >
> > --- snip --- 16+0 records in 16+0 records out 17179869184 bytes
> > (17 GB) copied, 10.9839 s, 1.6 GB/s
> >
> > Increased cached data by 16 GB
> >
> > 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied,
> > 5.03225 s, 3.4 GB/s --- snip ---
> >
> > Note that there is 16 GB of data cached and the read performance
> > is 3.4 GB/s as the data is actually read from page cache.
> >
> > And this is the result *with* ConstrainRAMSpace=yes set in
> > cgroups.conf and submitted with the very same command:
> >
> > --- snip --- 16+0 records in 16+0 records out 17179869184 bytes
> > (17 GB) copied, 13.3163 s, 1.3 GB/s
> >
> > Increased cached data by 1 GB
> >
> > 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied,
> > 11.1098 s, 1.5 GB/s --- snip ---
> >
> > Now only 1 GB of data has been cached (which is roughly the 2 GB
> > that have been requested for the job minus 1 GB allocated by the
> > dd buffer) resulting in a read performance degradation to 1.5 GB/s
> > (compared to 3.4 GB/s as above).
> >
> > Finally, this is the result with *with* ConstrainRAMSpace=yes set
> > in cgroups.conf and the job submitted with `sbatch --mem=18G
> > --gres=scratch:16 job.slurm´:
> >
> > --- snip --- 16+0 records in 16+0 records out 17179869184 bytes
> > (17 GB) copied, 11.0601 s, 1.6 GB/s
> >
> > Increased cached data by 16 GB
> >
> > 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied,
> > 5.01643 s, 3.4 GB/s --- snip ---
> >
> > This is almost the same result as in the unconstrained case (i.e.
> > without ConstrainRAMSpace=yes set in cgroups.conf) as the amount
> > of memory requested for the job (18 GB) is large enough to allow
> > the file to be fully cached in memory.
> >
> > I do not think this is an issue with Slurm itself but how cgroups
> > are supposed to work. However, I wonder how others cope with this.
> >
> > Maybe we have to teach our users to also consider page cache when
> > requesting a certain amount of memory for their jobs?
> >
> > Any comment or idea would be highly appreciated.
> >
> > Thank you in advance.
> >
> > Best regards Jürgen
> >
> > -- Jürgen Salk Scientific Software & Compute Services (SSCS)
> > Kommunikations- und Informationszentrum (kiz) Universität Ulm
> > Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471
> >
> 
> 
> -- Kilian
> 

-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A