[slurm-users] oom-kill events for no good reason

Thu Nov 7 18:14:17 UTC 2019

On 11/7/19 8:36 AM, David Baker wrote:

> We are dealing with some weird issue on our shared nodes where job 
> appear to be stalling for some reason. I was advised that this issue 
> might be related to the oom-killer process. We do see a lot of these 
> events. In fact when I started to take a closer look this afternoon I 
> noticed that all jobs on all nodes (not just the shared nodes) are 
> "firing" the oom-killer for some reason when they finish.

You should see the reason the OOM killer fired in "dmesg".

Do note though that it's not the main job step that's reporting that, 
it's the extern step.

If there's nothing there about the OOM killer then the message you see 
is likely wrong - from memory Slurm has a file descriptor where it 
should receive notifications of OOM killer events and so that should 
only increment when the kernel reports something on it.

We're seeing something similar here, but only for the external step 
(which seems to be what you're seeing too).

All the best,
Chris
-- 
   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA