<div dir="ltr">Hi Shawn,<div><br></div><div>I'm wondering if you're still seeing this. I've recently enabled task/cgroup on 17.11.5 running on CentOS 7 and just discovered that jobs are escaping their cgroups. For me this is resulting in a lot of jobs ending in OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks the oom-killer has triggered when it hasn't. I'm not using GRES or devices, only:</div><div><br></div><div>cgroup.conf:</div><div><br></div><div><div>CgroupAutomount=yes</div><div>ConstrainCores=yes</div><div>ConstrainRAMSpace=yes</div><div>ConstrainSwapSpace=yes</div></div><div><br></div><div>slurm.conf:</div><div><br></div><div><div>JobAcctGatherType=jobacct_gather/cgroup</div><div>JobAcctGatherFrequency=task=15</div><div>ProctrackType=proctrack/cgroup</div><div>TaskPlugin=task/cgroup</div></div><div><br></div><div>The only thing that seems to maybe correspond are the log messages:</div><div><br></div><div><div>[JOB_ID.batch] debug: Handling REQUEST_STATE</div></div><div>debug: _fill_registration_msg: found apparently running job JOB_ID<br></div><div><br></div><div>Thanks,</div><div>--nate</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Apr 23, 2018 at 4:41 PM, Kevin Manalo <span dir="ltr"><<a href="mailto:kmanalo@jhu.edu" target="_blank">kmanalo@jhu.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div lang="EN-US" link="blue" vlink="purple">
<div class="m_8834471019499354164WordSection1">
<p class="MsoNormal">Shawn,<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Just to give you a compare and contrast:<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">We have for related entries slurm.conf<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><span style="font-family:Courier">JobAcctGatherType=jobacct_<wbr>gather/linux # will migrate to cgroup eventually<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">JobAcctGatherFrequency=30<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">ProctrackType=proctrack/cgroup<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">TaskPlugin=task/affinity,task/<wbr>cgroup<u></u><u></u></span></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">cgroup_allowed_devices_file.<wbr>conf:<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><span style="font-family:Courier">/dev/null<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">/dev/urandom<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">/dev/zero<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">/dev/sda*<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">/dev/cpu/*/*<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">/dev/pts/*<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">/dev/nvidia*<u></u><u></u></span></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">gres.conf (4 K80s on node with 24 core haswell):<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><span style="font-family:Courier">Name=gpu File=/dev/nvidia0 CPUs=0-5<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">Name=gpu File=/dev/nvidia1 CPUs=12-17<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">Name=gpu File=/dev/nvidia2 CPUs=6-11<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier">Name=gpu File=/dev/nvidia3 CPUs=18-23<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-family:Courier"><u></u> <u></u></span></p>
<p class="MsoNormal">I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 day and they are still inside of cgroups, but again this is on CentOS6 clusters.<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Are you still seeing cgroup escapes now, specifically for jobs > 1 day?<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Thanks,<u></u><u></u></p>
<p class="MsoNormal">Kevin<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<div style="border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.<wbr>schedmd.com</a>> on behalf of Shawn Bobbin <<a href="mailto:sabobbin@umiacs.umd.edu" target="_blank">sabobbin@umiacs.umd.edu</a>><span class=""><br>
<b>Reply-To: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><wbr>><br>
</span><b>Date: </b>Monday, April 23, 2018 at 2:45 PM<span class=""><br>
<b>To: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><wbr>><br>
</span><b>Subject: </b>Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.<u></u><u></u></span></p>
</div><div><div class="h5">
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal">Hi, <u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I attached our cgroup.conf and gres.conf. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty. In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains
the same. Based on the behavior explained in [0] I don’t expect this file to impact specific GPU containment. <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo">TaskPlugin = task/cgroup<u></u><u></u></span></p>
</div>
<div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo">ProctrackType = proctrack/cgroup<u></u><u></u></span></p>
</div>
<div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo">JobAcctGatherType = jobacct_gather/cgroup<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo"><br>
<br>
<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo"><br>
<br>
<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo"><br>
<br>
<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo">[0] <a href="https://bugs.schedmd.com/show_bug.cgi?id=4122" target="_blank">https://bugs.schedmd.com/<wbr>show_bug.cgi?id=4122</a><u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo"><br>
<br>
<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo"><br>
<br>
<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal" style="background:white"><span style="font-size:8.5pt;font-family:Menlo"><br>
<br>
<u></u><u></u></span></p>
</div>
</div>
</div>
</div>
</div></div></div>
</div>
</blockquote></div><br></div>