[slurm-users] Spurious OOM-kills with cgroups on 20.11.8?

Tue Aug 10 22:58:16 UTC 2021

Hi Roger,

Thanks for the response. I am pretty sure the job is actually getting
killed. I don't see it running in the process table and the local SLURM log
just displays:

[2021-08-10T16:31:36.139] [6628753.batch] error: Detected 1 oom-kill
event(s) in StepId=6628753.batch cgroup. Some of your processes may have
been killed by the cgroup out-of-memory handler.

Best,

Sean

On Tue, Aug 10, 2021 at 5:13 PM Roger Moye <rmoye at quantlab.com> wrote:

> Do you know if the job is actually being killed?   We had an issue on an
> older version of slurm whereby we got OOM errors but the tasks actually
> completed.  The OOM came when the job exited and was a false error.
>
>
>
> Also, there are several bug reports open right now about an issue similar
> to what you have described.   You can go to bugs.schedmd.com to look at
> those bug reports.
>
>
>
> -Roger
>
>
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of
> *Sean Caron
> *Sent:* Tuesday, August 10, 2021 4:01 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>; Sean
> Caron <scaron at umich.edu>
> *Subject:* [slurm-users] Spurious OOM-kills with cgroups on 20.11.8?
>
>
>
> Hi all,
>
>
>
> Has anyone else observed jobs getting OOM-killed in 20.11.8 with cgroups
> that ran fine in previous versions like 20.10?
>
>
>
> I've had a few reports from users after upgrading maybe six weeks ago that
> their jobs are getting OOM-killed when they haven't changed anything and
> the job ran to completion in the past with the same memory specification.
>
>
>
> The most recent report I received today involved a job running a "cp"
> command getting OOM-killed. I have a hard time believing "cp" uses very
> much memory...
>
>
>
> These machines are running various 5.4.x or 5.3.x Linux kernels.
>
>
>
> I've had really good luck with the cgroups OOM-killer the last few years
> from keeping my nodes getting overwhelmed by runaway jobs. I'd hate to have
> to disable it just to clean up these weird issues.
>
>
>
> My cgroup.conf file looks like the following:
>
>
>
> CgroupAutomount=yes
>
> ConstrainCores=yes
>
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
>
> AllowedRamSpace=100
> AllowedSwapSpace=0
>
>
>
> Should I maybe bump AllowedRamSpace? I don't see how this is any different
> than just asking the user to re-run the job with a larger memory allocation
> request. And that doesn't explain why jobs suddenly need more memory before
> getting OOM-killed than they used to.
>
>
>
> Thanks,
>
>
>
> Sean
>
>
>
>
> -----------------------------------------------------------------------------------
>
> The information in this communication and any attachment is confidential
> and intended solely for the attention and use of the named addressee(s).
> All information and opinions expressed herein are subject to change without
> notice. This communication is not to be construed as an offer to sell or
> the solicitation of an offer to buy any security. Any such offer or
> solicitation can only be made by means of the delivery of a confidential
> private offering memorandum (which should be carefully reviewed for a
> complete description of investment strategies and risks). Any reliance one
> may place on the accuracy or validity of this information is at their own
> risk. Past performance is not necessarily indicative of the future results
> of an investment. All figures are estimated and unaudited unless otherwise
> noted. If you are not the intended recipient, or a person responsible for
> delivering this to the intended recipient, you are not authorized to and
> must not disclose, copy, distribute, or retain this message or any part of
> it. In this case, please notify the sender immediately at 713-333-5440
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210810/fa2a21af/attachment-0001.htm>