<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Not 100%, which is why I'm asking here.I searched the log files
      and that line was only present after a handful of jobs, including
      the ones I'm investigating, so it's not something happening
      after/to every job. However, this is happening on nodes with
      plenty of RAM, so if the OOM Killer is being invoked, something
      odd is definitely going on. <br>
    </p>
    <div class="moz-cite-prefix">On 7/2/20 10:20 AM, Ryan Novosielski
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:79D9B595-6DAB-4770-9718-820CD8DF1BCD@rutgers.edu">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      Are you sure that the OOM killer is involved? I can get you
      specifics later, but if it’s that one line about OOM events, you
      may see it after successful jobs too. I just had a SLURM bug where
      this came up. <br>
      <br>
      <div dir="ltr"><span style="background-color: rgba(255, 255, 255,
          0);">--<br>
          ____<br>
          || \\UTGERS,      
          |---------------------------*O*---------------------------<br>
          ||_// the State     |         Ryan Novosielski - <a
            href="mailto:novosirj@rutgers.edu" dir="ltr"
            x-apple-data-detectors="true"
            x-apple-data-detectors-type="link"
            x-apple-data-detectors-result="1" moz-do-not-send="true">novosirj@rutgers.edu</a><br>
          || \\ University | Sr. Technologist - 973/972.0922 (2x0922)
          ~*~ RBHS Campus<br>
          ||  \\    of NJ     | Office of Advanced Research Computing -
          MSB C630, Newark<br>
              `'</span></div>
      <div dir="ltr"><br>
        <blockquote type="cite">On Jul 2, 2020, at 09:53, Prentice
          Bisbal <a class="moz-txt-link-rfc2396E" href="mailto:pbisbal@pppl.gov"><pbisbal@pppl.gov></a> wrote:<br>
          <br>
        </blockquote>
      </div>
      <blockquote type="cite">
        <div dir="ltr"><span>I maintain a very heterogeneous cluster
            (different processors, different amounts of RAM, etc.) I
            have a user reporting the following problem.</span><br>
          <span></span><br>
          <span>He's running the same job multiple times with different
            input parameters. The jobs run fine unless they land on
            specific nodes. He's specifying --mem=2G in his sbatch
            files. On the nodes where the jobs fail, I see that the OOM
            killer is invoked, so I asked him to specify more RAM, so he
            did. He set --mem=4G, and still the jobs fail on these 2
            nodes. However, they run just fine on other nodes with
            --mem=2G.</span><br>
          <span></span><br>
          <span>When I look at the slurm log file on the nodes, I see
            something like this for a failing job (in this case,
            --mem=4G was set)</span><br>
          <span></span><br>
          <span>[2020-07-01T16:19:06.222] _run_prolog: prolog with lock
            for job 801777 ran for 0 seconds</span><br>
          <span>[2020-07-01T16:19:06.479] [801777.extern] task/cgroup:
            /slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
            memsw.limit=unlimited</span><br>
          <span>[2020-07-01T16:19:06.483] [801777.extern] task/cgroup:
            /slurm/uid_40324/job_801777/step_extern: alloc=4096MB
            mem.limit=4096MB memsw.limit=unlimited</span><br>
          <span>[2020-07-01T16:19:06.506] Launching batch job 801777 for
            UID 40324</span><br>
          <span>[2020-07-01T16:19:06.621] [801777.batch] task/cgroup:
            /slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
            memsw.limit=unlimited</span><br>
          <span>[2020-07-01T16:19:06.623] [801777.batch] task/cgroup:
            /slurm/uid_40324/job_801777/step_batch: alloc=4096MB
            mem.limit=4096MB memsw.limit=unlimited</span><br>
          <span>[2020-07-01T16:19:19.385] [801777.batch] sending
            REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0</span><br>
          <span>[2020-07-01T16:19:19.389] [801777.batch] done with job</span><br>
          <span>[2020-07-01T16:19:19.463] [801777.extern]
            _oom_event_monitor: oom-kill event count: 1</span><br>
          <span>[2020-07-01T16:19:19.508] [801777.extern] done with job</span><br>
          <span></span><br>
          <span>Any ideas why the jobs are failing on just these two
            nodes, while they run just fine on many other nodes?</span><br>
          <span></span><br>
          <span>For now, the user is excluding these two nodes using the
            -x option to sbatch, but I'd really like to understand
            what's going on here.</span><br>
          <span></span><br>
          <span>-- </span><br>
          <span></span><br>
          <span>Prentice</span><br>
          <span></span><br>
          <span></span><br>
        </div>
      </blockquote>
    </blockquote>
    <pre class="moz-signature" cols="72">-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
<a class="moz-txt-link-freetext" href="http://www.pppl.gov">http://www.pppl.gov</a></pre>
  </body>
</html>