[slurm-users] slurm, memory accounting and memory mapping

Thu Jan 31 20:22:09 UTC 2019

Hi, 

Thanks again for all the suggestions. 
It turns out that on our cluster we can't use the cgroups because of the old kernel, 
but setting 
    JobAcctGatherParams=UsePSS
resolved the problems.

Regards,
         Sergey

On Fri, 2019-01-11 at 10:37 +0200, Janne Blomqvist wrote:
> On 11/01/2019 08.29, Sergey Koposov wrote:
> > Hi,
> > 
> > I've recently migrated to slurm from pbs on our cluster. Because of that, now the job memory limits are
> > strictly enforced and that causes my code to get killed.
> > The trick is that my code uses memory mapping (i.e. mmap) of one single large file (~12 Gb) in each thread on each node.
> > With this technique in the past despite the fact the file is (read-only) mmaped in say 16 threads, the actual memory footprint was still ~ 12 Gb.
> > However, when I now do this in slurm, it thinks that each thread (or process) takes 12Gb and kills my processes.
> > 
> > Does anyone has a way around this problem ? Other then stoping using Memory as a consumable resource, or faking that each node has more memory ?
> > 
> > Here is an example slurm script that I'm running
> > #!/bin/bash
> > #SBATCH -N 1 # number of nodes
> > #SBATCH --cpus-per-task=10 # number of cores
> > #SBATCH --ntasks-per-node=1
> > #SBATCH --mem=125GB
> > #SBATCH --array=0-4
> > 
> > sh script1.sh $SLURM_ARRAY_TASK_ID 5
> > 
> > The script1 essentially starts python which in turn create 10 multiprocessing processes each of which will mmap the large file.
> > ------
> > In this case I'm forced to limit myself to using only 10 threads, instead of 16 (our machines have 16 cores) to avoid being killed by slurm.
> > ---
> > Thanks in advance for any suggestions.
> >          
> >             Sergey
> > 
> 
> What is your memory limit configuration in slurm? Anyway, a few things to check:
> 
> - Make sure you're not limiting RLIMIT_AS in any way (e.g. run "ulimit -v" in your batch script, ensure it's unlimited. In the slurm config, ensure
> VSizeFactor=0).
> - Are you using task/cgroup for limiting memory? In that case the problem might be that cgroup memory limits work with RSS, and as you're running multiple
> processes the shared mmap'ed file will be counted multiple times. There's no really good way around this, but with, say, something like
> 
> ConstrainRAMSpace=no
> ConstrainSwapSpace=yes
> AllowedRAMSpace=100
> AllowedSwapSpace=1600
> you'll get a setup where the cgroup soft limit will be set to the amount your job allocates, but the hard limit (where the job will be killed) will be set to
> 1600% of that.
> - If you're using cgroups for memory limits, you should also set JobAcctGatherParams=NoOverMemoryKill
> - If you're NOT using cgroups for memory limits, try setting JobAcctGatherParams=UsePSS which should avoiding counting the shared mappings multiple times.
>