[slurm-users] ConstrainRAMSpace=yes and page cache?

Fri Jun 21 14:25:55 UTC 2019

Dear Sam,

thanks for your response. And sorry, if I overstress the subject a
bit.

I've experimented a bit more with various configuration options.

As an alternative approach I have now dropped `ConstrainRAMSpace=yes´
from cgroups.conf but added `JobAcctGatherParams=OverMemoryKill´ in
slurm.conf along with `JobAcctGatherType=jobacct_gather/linux´ (which is
supposed to be the recommended job accounting mechanism) and
`JobAcctGatherFrequency=30´ (which is the default sampling interval
anyway and we don't expect jobs having more than 10,000 tasks). 

I am aware that this is probably more a workaround rather than a
waterproof solution for constraining the memory consumption of a job
(as it suffers from the sampling interval latency and, maybe even
worse, can also be bypassed by means of --acctg-freq=0). 

On the other hand, this seems to allow any unused memory areas to be
used flexibly for page cache by the operating system regardless of
what amount of memory the user requested for the job but still
provides some protection against jobs that exceed the amount of memory
requested at submission.

Maybe this is at least kind of trade off between somewhat
overexaggerated confinement of memory consumption (in the sense of 
totally sacrificing benefits of page cache as with ConstrainRAMSpace=yes
set in cgroups.conf) and no memory constraints at all.

Does anyone use a similar approach in a productive environment and is
willing to share his or her practical experience? Any thoughts are
highly appreciated.

Have a nice weekend. 

Best regards
Jürgen Salk

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471

* Sam Gallop (NBI) <sam.gallop at nbi.ac.uk> [190614 11:21]:
> Hi Jürgen,
> 
> I'm not aware of a Slurm-onic way of doing this. As you've said this
> is the behaviour cgroups, which Slurm is employing. As I understand
> it, upon allocation the page cache is accounted within the calling
> process's cgroup, and I'm not aware of way of preventing the memory
> resource controller from not accounting on the page cache.
> 
> I would also advise caution with using ConstrainKmemSpace=yes. We've
> recently turned it off (ConstrainKmemSpace=no) as we were
> experiencing a number of job failures due to memory leaks. (see
> https://bugzilla.redhat.com/show_bug.cgi?id=1507149,
> https://bugs.schedmd.com/show_bug.cgi?id=3694,
> https://slurm.schedmd.com/archive/slurm-18.08-latest/news.html -
> search for ConstrainKmemSpace). In Slurm v18.08 its disabled by
> default.
> 
> As an aside - a helpful way to see exactly how many pages are
> allocated to a file you may consider using either vmtouch
> (https://hoytech.com/vmtouch/) and or linux-ftools
> (https://code.google.com/archive/p/linux-ftools/). Both use the
> mincore/fadvise to query, allocate, evict page cache for specific
> files or directories.
> 
> ---
> Sam Gallop
> 
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Juergen Salk
> Sent: 14 June 2019 09:14
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] ConstrainRAMSpace=yes and page cache?
> 
> Dear Kilian,
> 
> thanks for pointing this out. I should have mentioned that I had
> already browsed the croups.conf man page up and down but did not
> find any specific hints on how to achieve the desired behavior.
> Maybe I am still missing something obvious?
> 
> Also the kernel cgroups documentation indicates that page cache and
> anonymous memory, are both tied to userland memory[1]:
> 
> --- snip ---
> While not completely water-tight, all major memory usages by a given
> cgroup are tracked so that the total memory consumption can be
> accounted and controlled to a reasonable extent. Currently, the
> following types of memory usages are tracked.
> 
>     Userland memory - page cache and anonymous memory.
>     Kernel data structures such as dentries and inodes.
>     TCP socket buffers.
> --- snip ---
> 
> That's why I'm somewhat unsure whether KmemSpace options in
> cgroups.conf can address this issue.
> 
> I guess my question simply boils down to whether there is a
> Slurm-ish way to prevent active page caches from being counted
> against memory constraints when ConstrainRAMSpace=yes is set?
> 
> Best regards
> Jürgen
> 
> [1] https://www.kernel.org/doc/html/v4.18/admin-guide/cgroup-v2.html
> 
> --
> Jürgen Salk
> Scientific Software & Compute Services (SSCS)
> Kommunikations- und Informationszentrum (kiz) Universität Ulm
> Telefon: +49 (0)731 50-22478
> Telefax: +49 (0)731 50-22471
> 
> 
> 
> * Kilian Cavalotti <kilian.cavalotti.work at gmail.com> [190613 17:27]:
> > Hi Jürgen,
> > 
> > I would take a look at the various *KmemSpace options in
> > cgroups.conf, they can certainly help with this.
> > 
> > Cheers, -- Kilian
> > 
> > On Thu, Jun 13, 2019 at 2:41 PM Juergen Salk
> > <juergen.salk at uni-ulm.de> wrote:
> > >
> > > Dear all,
> > >
> > > I'm just starting to get used to Slurm and play around with it in
> > > a small test environment within our old cluster.
> > >
> > > For our next system we will probably have to abandon our current
> > > exclusive user node access policy in favor of a shared user
> > > policy, i.e. jobs from different users will then run side by side
> > > on the same node at the same time. In order to prevent the jobs
> > > from interfering with each other, I have set both
> > > ConstrainCores=yes and ConstrainRAMSpace=yes in cgroups.conf,
> > > which works as expected for limiting the memory of the processes
> > > to the value requested at job submission (e.g. by --mem=...
> > > option).
> > >
> > > However, I've noticed that ConstrainRAMSpace=yes does also cap the
> > > available page cache for which the Linux kernel normally exploits
> > > any unused areas of the memory in a flexible way. This may result
> > > in a significant performance impact as we do have quite a number
> > > of IO demanding applications (predominated by read operations)
> > > that are known to benefit a lot from page caching.
> > >
> > > Here comes a small example to illustrate this issue. The job
> > > writes a 16 GB file to a local scratch file system, measures the
> > > amount of data cached in memory and then reads the file previously
> > > written.
> > >
> > > $ cat job.slurm #!/bin/bash #SBATCH --partition=standard #SBATCH
> > > --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --time=00:10:00
> > >
> > > # Get amount of data cached in memory before writing the file
> > > cached1=`awk '$1=="Cached:" {print $2}' /proc/meminfo`
> > >
> > > # Write 16 GB file to local scratch SSD dd if=/dev/zero
> > > of=$SCRATCH/testfile count=16 bs=1024M
> > >
> > > # Get amount of data cached in memory after writing the file
> > > cached2=`awk '$1=="Cached:" {print $2}' /proc/meminfo`
> > >
> > > # Print difference of data cached in memory echo -e "\nIncreased
> > > cached data by $(((cached2-cached1)/1000000)) GB\n"
> > >
> > > # Read the file previously written dd if=$SCRATCH/testfile
> > > of=/dev/null count=16 bs=1024M
> > >
> > > $
> > >
> > > For reference, this is the result *without* ConstrainRAMSpace=yes
> > > set in cgroups.conf and submitted with `sbatch --mem=2G
> > > --gres=scratch:16 job.slurm´
> > >
> > > --- snip --- 16+0 records in 16+0 records out 17179869184 bytes
> > > (17 GB) copied, 10.9839 s, 1.6 GB/s
> > >
> > > Increased cached data by 16 GB
> > >
> > > 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied,
> > > 5.03225 s, 3.4 GB/s --- snip ---
> > >
> > > Note that there is 16 GB of data cached and the read performance
> > > is 3.4 GB/s as the data is actually read from page cache.
> > >
> > > And this is the result *with* ConstrainRAMSpace=yes set in
> > > cgroups.conf and submitted with the very same command:
> > >
> > > --- snip --- 16+0 records in 16+0 records out 17179869184 bytes
> > > (17 GB) copied, 13.3163 s, 1.3 GB/s
> > >
> > > Increased cached data by 1 GB
> > >
> > > 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied,
> > > 11.1098 s, 1.5 GB/s --- snip ---
> > >
> > > Now only 1 GB of data has been cached (which is roughly the 2 GB
> > > that have been requested for the job minus 1 GB allocated by the
> > > dd buffer) resulting in a read performance degradation to 1.5 GB/s
> > > (compared to 3.4 GB/s as above).
> > >
> > > Finally, this is the result with *with* ConstrainRAMSpace=yes set
> > > in cgroups.conf and the job submitted with `sbatch --mem=18G
> > > --gres=scratch:16 job.slurm´:
> > >
> > > --- snip --- 16+0 records in 16+0 records out 17179869184 bytes
> > > (17 GB) copied, 11.0601 s, 1.6 GB/s
> > >
> > > Increased cached data by 16 GB
> > >
> > > 16+0 records in 16+0 records out 17179869184 bytes (17 GB) copied,
> > > 5.01643 s, 3.4 GB/s --- snip ---
> > >
> > > This is almost the same result as in the unconstrained case (i.e.
> > > without ConstrainRAMSpace=yes set in cgroups.conf) as the amount
> > > of memory requested for the job (18 GB) is large enough to allow
> > > the file to be fully cached in memory.
> > >
> > > I do not think this is an issue with Slurm itself but how cgroups
> > > are supposed to work. However, I wonder how others cope with this.
> > >
> > > Maybe we have to teach our users to also consider page cache when
> > > requesting a certain amount of memory for their jobs?
> > >
> > > Any comment or idea would be highly appreciated.
> > >
> > > Thank you in advance.
> > >
> > > Best regards Jürgen
> > >
> > > -- Jürgen Salk Scientific Software & Compute Services (SSCS)
> > > Kommunikations- und Informationszentrum (kiz) Universität Ulm
> > > Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471
> > >
> >