<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi David,<br>
<br>
yes, I see these messages also. I also think, this is more likely a
wrong message. If a job has been cancelled by the OOM-Killer, you
can see this with sacct, e.g.<br>
$> sacct -j 10816098<br>
JobID JobName Partition Account AllocCPUS State
ExitCode <br>
------------ ---------- ---------- ---------- ---------- ----------
-------- <br>
10816098 VASP_MPI c18m default 12
OUT_OF_ME+ 0:125 <br>
10816098.ba+ batch default 12
OUT_OF_ME+ 0:125 <br>
10816098.ex+ extern default 12
COMPLETED 0:0 <br>
10816098.0 vasp_mpi default 12
OUT_OF_ME+ 0:125 <br>
<br>
Best<br>
Marcus<br>
<br>
<div class="moz-cite-prefix">On 11/7/19 5:36 PM, David Baker wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CWXP265MB0376B6560CB7E9537CAF19E3FE780@CWXP265MB0376.GBRP265.PROD.OUTLOOK.COM">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Hello, </div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
We are dealing with some weird issue on our shared nodes where
job appear to be stalling for some reason. I was advised that
this issue might be related to the oom-killer process. We do see
a lot of these events. In fact when I started to take a closer
look this afternoon I noticed that all jobs on all nodes (not
just the shared nodes) are "firing" the oom-killer for some
reason when they finish. </div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
As a demo I launched a very simple (low memory usage) test jobs
on a shared node and then after a few minutes cancelled it to
show the behaviour. Looking in the slurmd.log -- see below -- we
see the oom-killer being fired for no good reason. This "feels"
vaguely similar to this bug -- <a
href="https://bugs.schedmd.com/show_bug.cgi?id=5121"
moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5121</a> which
I understand was patched back in SLURM v17 (we are using v18*). </div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Has anyone else seen this behaviour? Or more to the point does
anyone understand this behaviour and know how to squash it,
please?</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Best regards,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
David</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<span>[2019-11-07T16:14:52.551] Launching batch job 164978 for
UID 57337<br>
</span>
<div>[2019-11-07T16:14:52.559] [164977.batch] task/cgroup:
/slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB
memsw.limit=unlimited<br>
</div>
<div>[2019-11-07T16:14:52.560] [164977.batch] task/cgroup:
/slurm/uid_57337/job_164977/step_batch: alloc=23640MB
mem.limit=23640MB memsw.limit=unlimited<br>
</div>
<div>[2019-11-07T16:14:52.584] [164978.batch] task/cgroup:
/slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB
memsw.limit=unlimited<br>
</div>
<div>[2019-11-07T16:14:52.584] [164978.batch] task/cgroup:
/slurm/uid_57337/job_164978/step_batch: alloc=23640MB
mem.limit=23640MB memsw.limit=unlimited<br>
</div>
<div>[2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch:
Using sched_affinity for tasks<br>
</div>
<div>[2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch:
Using sched_affinity for tasks<br>
</div>
<div>[2019-11-07T16:16:05.859] [164977.batch] error: *** JOB
164977 ON gold57 CANCELLED AT 2019-11-07T16:16:05 ***<br>
</div>
<div>[2019-11-07T16:16:05.882] [164977.extern] <b>_oom_event_monitor:
oom-kill event count: 1</b><br>
</div>
<span>[2019-11-07T16:16:05.886] [164977.extern] done with job</span><br>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
</pre>
</body>
</html>