[slurm-users] Jobs escaping cgroup device controls after some amount of time.

Mon Apr 30 15:47:29 MDT 2018

> On 30 Apr 2018, at 22:37, Nate Coraor <nate at bx.psu.edu> wrote:
> 
> Hi Shawn,
> 
> I'm wondering if you're still seeing this. I've recently enabled task/cgroup on 17.11.5 running on CentOS 7 and just discovered that jobs are escaping their cgroups. For me this is resulting in a lot of jobs ending in OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks the oom-killer has triggered when it hasn't. I'm not using GRES or devices, only:

I am not sure that you are making the correct conclusion here.

There is a known cgroups issue, due to

https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

Relevant part:

The memory controller has a long history. A request for comments for the memory
controller was posted by Balbir Singh [1]. At the time the RFC was posted
there were several implementations for memory control. The goal of the
RFC was to build consensus and agreement for the minimal features required
for memory control. The first RSS controller was posted by Balbir Singh[2]
in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
RSS controller. At OLS, at the resource management BoF, everyone suggested
that we handle both page cache and RSS together. Another request was raised
to allow user space handling of OOM. The current memory controller is
at version 6; it combines both mapped (RSS) and unmapped Page
Cache Control [11].

Are the jobs killed prematurely? If not, then you ran into the above.

Kind regards.
— Andy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180430/19649fb8/attachment-0001.sig>