[slurm-users] Spurious OOM-kills with cgroups on 20.11.8?

Tue Aug 10 21:00:50 UTC 2021

Hi all,

Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups
that ran fine in previous versions like 20.10?

I've had a few reports from users after upgrading maybe six weeks ago that
their jobs are getting OOM-killed when they haven't changed anything and
the job ran to completion in the past with the same memory specification.

The most recent report I received today involved a job running a "cp"
command getting OOM-killed. I have a hard time believing "cp" uses very
much memory...

These machines are running various 5.4.x or 5.3.x Linux kernels.

I've had really good luck with the cgroups OOM-killer the last few years
from keeping my nodes getting overwhelmed by runaway jobs. I'd hate to have
to disable it just to clean up these weird issues.

My cgroup.conf file looks like the following:

CgroupAutomount=yes

ConstrainCores=yes

ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

AllowedRamSpace=100
AllowedSwapSpace=0

Should I maybe bump AllowedRamSpace? I don't see how this is any different
than just asking the user to re-run the job with a larger memory allocation
request. And that doesn't explain why jobs suddenly need more memory before
getting OOM-killed than they used to.

Thanks,

Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210810/f5590711/attachment-0001.htm>