<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Hi David,<br>

    <br>

    yes, I see these messages also. I also think, this is more likely a

    wrong message. If a job has been cancelled by the OOM-Killer, you

    can see this with sacct, e.g.<br>

    $> sacct -j 10816098<br>

           JobID    JobName  Partition    Account  AllocCPUS      State

    ExitCode <br>

    ------------ ---------- ---------- ---------- ---------- ----------

    -------- <br>

    10816098       VASP_MPI       c18m    default         12

    OUT_OF_ME+    0:125 <br>

    10816098.ba+      batch               default         12

    OUT_OF_ME+    0:125 <br>

    10816098.ex+     extern               default         12 

    COMPLETED      0:0 <br>

    10816098.0     vasp_mpi               default         12

    OUT_OF_ME+    0:125 <br>

    <br>

    Best<br>

    Marcus<br>

    <br>

    <div class="moz-cite-prefix">On 11/7/19 5:36 PM, David Baker wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CWXP265MB0376B6560CB7E9537CAF19E3FE780@CWXP265MB0376.GBRP265.PROD.OUTLOOK.COM">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        Hello, </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        We are dealing with some weird issue on our shared nodes where

        job appear to be stalling for some reason. I was advised that

        this issue might be related to the oom-killer process. We do see

        a lot of these events. In fact when I started to take a closer

        look this afternoon I noticed that all jobs on all nodes (not

        just the shared nodes) are "firing" the oom-killer for some

        reason when they finish. </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        As a demo I launched a very simple (low memory usage) test jobs

        on a shared node  and then after a few minutes cancelled it to

        show the behaviour. Looking in the slurmd.log -- see below -- we

        see the oom-killer being fired for no good reason. This "feels"

        vaguely similar to this bug -- <a

          href="https://bugs.schedmd.com/show_bug.cgi?id=5121"

          moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=5121</a> which

        I understand was patched back in SLURM v17 (we are using v18*). </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        Has anyone else seen this behaviour? Or more to the point does

        anyone understand this behaviour and know how to squash it,

        please?</div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        Best regards,</div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        David</div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <br>

      </div>

      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;

        font-size: 12pt; color: rgb(0, 0, 0);">

        <span>[2019-11-07T16:14:52.551] Launching batch job 164978 for

          UID 57337<br>

        </span>

        <div>[2019-11-07T16:14:52.559] [164977.batch] task/cgroup:

          /slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB

          memsw.limit=unlimited<br>

        </div>

        <div>[2019-11-07T16:14:52.560] [164977.batch] task/cgroup:

          /slurm/uid_57337/job_164977/step_batch: alloc=23640MB

          mem.limit=23640MB memsw.limit=unlimited<br>

        </div>

        <div>[2019-11-07T16:14:52.584] [164978.batch] task/cgroup:

          /slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB

          memsw.limit=unlimited<br>

        </div>

        <div>[2019-11-07T16:14:52.584] [164978.batch] task/cgroup:

          /slurm/uid_57337/job_164978/step_batch: alloc=23640MB

          mem.limit=23640MB memsw.limit=unlimited<br>

        </div>

        <div>[2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch:

          Using sched_affinity for tasks<br>

        </div>

        <div>[2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch:

          Using sched_affinity for tasks<br>

        </div>

        <div>[2019-11-07T16:16:05.859] [164977.batch] error: *** JOB

          164977 ON gold57 CANCELLED AT 2019-11-07T16:16:05 ***<br>

        </div>

        <div>[2019-11-07T16:16:05.882] [164977.extern] <b>_oom_event_monitor:

            oom-kill event count: 1</b><br>

        </div>

        <span>[2019-11-07T16:16:05.886] [164977.extern] done with job</span><br>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Marcus Wagner, Dipl.-Inf.

IT Center

Abteilung: Systeme und Betrieb

RWTH Aachen University

Seffenter Weg 23

52074 Aachen

Tel: +49 241 80-24383

Fax: +49 241 80-624383

<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>

<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>

</pre>

  </body>

</html>