[slurm-users] oom-kill events for no good reason
Marcus Wagner
wagner at itc.rwth-aachen.de
Fri Nov 8 13:00:54 UTC 2019
Hi David,
yes, I see these messages also. I also think, this is more likely a
wrong message. If a job has been cancelled by the OOM-Killer, you can
see this with sacct, e.g.
$> sacct -j 10816098
JobID JobName Partition Account AllocCPUS State
ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
10816098 VASP_MPI c18m default 12 OUT_OF_ME+
0:125
10816098.ba+ batch default 12 OUT_OF_ME+
0:125
10816098.ex+ extern default 12 COMPLETED 0:0
10816098.0 vasp_mpi default 12 OUT_OF_ME+
0:125
Best
Marcus
On 11/7/19 5:36 PM, David Baker wrote:
> Hello,
>
> We are dealing with some weird issue on our shared nodes where job
> appear to be stalling for some reason. I was advised that this issue
> might be related to the oom-killer process. We do see a lot of these
> events. In fact when I started to take a closer look this afternoon I
> noticed that all jobs on all nodes (not just the shared nodes) are
> "firing" the oom-killer for some reason when they finish.
>
> As a demo I launched a very simple (low memory usage) test jobs on a
> shared node and then after a few minutes cancelled it to show the
> behaviour. Looking in the slurmd.log -- see below -- we see the
> oom-killer being fired for no good reason. This "feels" vaguely
> similar to this bug --
> https://bugs.schedmd.com/show_bug.cgi?id=5121 which I understand was
> patched back in SLURM v17 (we are using v18*).
>
> Has anyone else seen this behaviour? Or more to the point does anyone
> understand this behaviour and know how to squash it, please?
>
> Best regards,
> David
>
> [2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337
> [2019-11-07T16:14:52.559] [164977.batch] task/cgroup:
> /slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB
> memsw.limit=unlimited
> [2019-11-07T16:14:52.560] [164977.batch] task/cgroup:
> /slurm/uid_57337/job_164977/step_batch: alloc=23640MB
> mem.limit=23640MB memsw.limit=unlimited
> [2019-11-07T16:14:52.584] [164978.batch] task/cgroup:
> /slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB
> memsw.limit=unlimited
> [2019-11-07T16:14:52.584] [164978.batch] task/cgroup:
> /slurm/uid_57337/job_164978/step_batch: alloc=23640MB
> mem.limit=23640MB memsw.limit=unlimited
> [2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Using
> sched_affinity for tasks
> [2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Using
> sched_affinity for tasks
> [2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ON
> gold57 CANCELLED AT 2019-11-07T16:16:05 ***
> [2019-11-07T16:16:05.882] [164977.extern] *_oom_event_monitor:
> oom-kill event count: 1*
> [2019-11-07T16:16:05.886] [164977.extern] done with job
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191108/ee4ecf97/attachment.htm>
More information about the slurm-users
mailing list