[slurm-users] Jobs killed by OOM-killer only on certain nodes.

Chris Samuel chris at csamuel.org
Thu Jul 2 20:23:48 UTC 2020


On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote:

> [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill
> event count: 1

We get that line for pretty much every job, I don't think it reflects the OOM 
killer being invoked on something in the extern step.

OOM killer invocations should be recorded in the kernel logs on the node, 
check with "dmesg -T" to see if it's being invoked (or whether they are 
getting logged to via syslog if they've got dropped from the ring buffer due to 
later messages).

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






More information about the slurm-users mailing list