[slurm-users] Jobs killed by OOM-killer only on certain nodes.
Chris Samuel
chris at csamuel.org
Thu Jul 2 20:23:48 UTC 2020
On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote:
> [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill
> event count: 1
We get that line for pretty much every job, I don't think it reflects the OOM
killer being invoked on something in the extern step.
OOM killer invocations should be recorded in the kernel logs on the node,
check with "dmesg -T" to see if it's being invoked (or whether they are
getting logged to via syslog if they've got dropped from the ring buffer due to
later messages).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list