[slurm-users] slurm, memory accounting and memory mapping

Fri Jan 11 08:37:33 UTC 2019

On 11/01/2019 08.29, Sergey Koposov wrote:
> Hi,
>
> I've recently migrated to slurm from pbs on our cluster. Because of that, now the job memory limits are
> strictly enforced and that causes my code to get killed.
> The trick is that my code uses memory mapping (i.e. mmap) of one single large file (~12 Gb) in each thread on each node.
> With this technique in the past despite the fact the file is (read-only) mmaped in say 16 threads, the actual memory footprint was still ~ 12 Gb.
> However, when I now do this in slurm, it thinks that each thread (or process) takes 12Gb and kills my processes.
>
> Does anyone has a way around this problem ? Other then stoping using Memory as a consumable resource, or faking that each node has more memory ?
>
> Here is an example slurm script that I'm running
> #!/bin/bash
> #SBATCH -N 1 # number of nodes
> #SBATCH --cpus-per-task=10 # number of cores
> #SBATCH --ntasks-per-node=1
> #SBATCH --mem=125GB
> #SBATCH --array=0-4
>
> sh script1.sh $SLURM_ARRAY_TASK_ID 5
>
> The script1 essentially starts python which in turn create 10 multiprocessing processes each of which will mmap the large file.
> ------
> In this case I'm forced to limit myself to using only 10 threads, instead of 16 (our machines have 16 cores) to avoid being killed by slurm.
> ---
> Thanks in advance for any suggestions.
>          
>             Sergey
>
What is your memory limit configuration in slurm? Anyway, a few things to check:

- Make sure you're not limiting RLIMIT_AS in any way (e.g. run "ulimit -v" in your batch script, ensure it's unlimited. In the slurm config, ensure VSizeFactor=0).
- Are you using task/cgroup for limiting memory? In that case the problem might be that cgroup memory limits work with RSS, and as you're running multiple processes the shared mmap'ed file will be counted multiple times. There's no really good way around this, but with, say, something like

ConstrainRAMSpace=no
ConstrainSwapSpace=yes
AllowedRAMSpace=100
AllowedSwapSpace=1600
you'll get a setup where the cgroup soft limit will be set to the amount your job allocates, but the hard limit (where the job will be killed) will be set to 1600% of that.
- If you're using cgroups for memory limits, you should also set JobAcctGatherParams=NoOverMemoryKill
- If you're NOT using cgroups for memory limits, try setting JobAcctGatherParams=UsePSS which should avoiding counting the shared mappings multiple times.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqvist at aalto.fi