[slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

Fri Dec 7 14:26:52 MST 2018

This is only so relevant, but the scenario presents itself similarly. This is not in a scheduler environment, but we have an interactive server that would have PS hangs on certain tasks (top -bn1 is a way around that, BTW, if it’s hard to even find out what the process is). For us, it appeared to be a process that was using a lot of memory that khugepaged was attempting to manipulate.

https://access.redhat.com/solutions/46111

I have never seen this happen on 7.x that I can recall. On our 6.x machine where we’ve seen it happen, all we did was this:

echo "madvise" > /sys/kernel/mm/redhat_transparent_hugepage/defrag

…in /etc/rc.local (which I hate, but I’m not sure where else that can go — maybe on the boot command line). This prevented nearly 100% of our problems.

No idea if that has anything to do with your situation.

> On Nov 29, 2018, at 1:27 PM, Christopher Benjamin Coffey <Chris.Coffey at nau.edu> wrote:
> 
> Hi,
> 
> We've been noticing an issue with nodes from time to time that become "wedged", or unusable. This is a state where ps, and w hang. We've been looking into this for a while when we get time and finally put some more effort into it yesterday. We came across this blog which describes almost the exact scenario:
> 
> https://rachelbythebay.com/w/2014/10/27/ps/
> 
> It has nothing to do with Slurm, but it does have to do with cgroups which we have enabled. It appears that processes that have hit their ceiling for memory and should be killed by oom-killer, and are in D state at the same time, cause the system to become wedged. For each node wedged, I've found a job out in:
> 
> /cgroup/memory/slurm/uid_3665/job_15363106/step_batch
> - memory.max_usage_in_bytes
> - memory.limit_in_bytes
> 
> The two files are the same bytes, which I'd think would be a candidate for oom-killer. But memory.oom_control says:
> 
> oom_kill_disable 0
> under_oom 0
> 
> My feeling is that the process was in D state, the oom-killer tried to be invoked, but then didn't and the system became wedged.
> 
> Has anyone run into this? If so, whats the fix? Apologies if this has been discussed before, I haven't noticed it on the group.
> 
> I wonder if it’s a bug in the oom-killer? Maybe it's been patched in a more recent kernel but looking at the kernels in the 6.10 series it doesn't look like a newer one would have a patch for a oom-killer bug.
> 
> Our setup is:
> 
> Centos 6.10
> 2.6.32-642.6.2.el6.x86_64
> Slurm 17.11.12
> 
> And /etc/slurm/cgroup.conf
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> 
> Cheers,
> Chris
> 
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'