[slurm-users] Spurious OOM-kills with cgroups on 20.11.8?
Sean Caron
scaron at umich.edu
Tue Aug 10 21:00:50 UTC 2021
Hi all,
Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups
that ran fine in previous versions like 20.10?
I've had a few reports from users after upgrading maybe six weeks ago that
their jobs are getting OOM-killed when they haven't changed anything and
the job ran to completion in the past with the same memory specification.
The most recent report I received today involved a job running a "cp"
command getting OOM-killed. I have a hard time believing "cp" uses very
much memory...
These machines are running various 5.4.x or 5.3.x Linux kernels.
I've had really good luck with the cgroups OOM-killer the last few years
from keeping my nodes getting overwhelmed by runaway jobs. I'd hate to have
to disable it just to clean up these weird issues.
My cgroup.conf file looks like the following:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRamSpace=100
AllowedSwapSpace=0
Should I maybe bump AllowedRamSpace? I don't see how this is any different
than just asking the user to re-run the job with a larger memory allocation
request. And that doesn't explain why jobs suddenly need more memory before
getting OOM-killed than they used to.
Thanks,
Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210810/f5590711/attachment-0001.htm>
More information about the slurm-users
mailing list