<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Not 100%, which is why I'm asking here.I searched the log files
and that line was only present after a handful of jobs, including
the ones I'm investigating, so it's not something happening
after/to every job. However, this is happening on nodes with
plenty of RAM, so if the OOM Killer is being invoked, something
odd is definitely going on. <br>
</p>
<div class="moz-cite-prefix">On 7/2/20 10:20 AM, Ryan Novosielski
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:79D9B595-6DAB-4770-9718-820CD8DF1BCD@rutgers.edu">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Are you sure that the OOM killer is involved? I can get you
specifics later, but if it’s that one line about OOM events, you
may see it after successful jobs too. I just had a SLURM bug where
this came up. <br>
<br>
<div dir="ltr"><span style="background-color: rgba(255, 255, 255,
0);">--<br>
____<br>
|| \\UTGERS,
|---------------------------*O*---------------------------<br>
||_// the State | Ryan Novosielski - <a
href="mailto:novosirj@rutgers.edu" dir="ltr"
x-apple-data-detectors="true"
x-apple-data-detectors-type="link"
x-apple-data-detectors-result="1" moz-do-not-send="true">novosirj@rutgers.edu</a><br>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922)
~*~ RBHS Campus<br>
|| \\ of NJ | Office of Advanced Research Computing -
MSB C630, Newark<br>
`'</span></div>
<div dir="ltr"><br>
<blockquote type="cite">On Jul 2, 2020, at 09:53, Prentice
Bisbal <a class="moz-txt-link-rfc2396E" href="mailto:pbisbal@pppl.gov"><pbisbal@pppl.gov></a> wrote:<br>
<br>
</blockquote>
</div>
<blockquote type="cite">
<div dir="ltr"><span>I maintain a very heterogeneous cluster
(different processors, different amounts of RAM, etc.) I
have a user reporting the following problem.</span><br>
<span></span><br>
<span>He's running the same job multiple times with different
input parameters. The jobs run fine unless they land on
specific nodes. He's specifying --mem=2G in his sbatch
files. On the nodes where the jobs fail, I see that the OOM
killer is invoked, so I asked him to specify more RAM, so he
did. He set --mem=4G, and still the jobs fail on these 2
nodes. However, they run just fine on other nodes with
--mem=2G.</span><br>
<span></span><br>
<span>When I look at the slurm log file on the nodes, I see
something like this for a failing job (in this case,
--mem=4G was set)</span><br>
<span></span><br>
<span>[2020-07-01T16:19:06.222] _run_prolog: prolog with lock
for job 801777 ran for 0 seconds</span><br>
<span>[2020-07-01T16:19:06.479] [801777.extern] task/cgroup:
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
memsw.limit=unlimited</span><br>
<span>[2020-07-01T16:19:06.483] [801777.extern] task/cgroup:
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB
mem.limit=4096MB memsw.limit=unlimited</span><br>
<span>[2020-07-01T16:19:06.506] Launching batch job 801777 for
UID 40324</span><br>
<span>[2020-07-01T16:19:06.621] [801777.batch] task/cgroup:
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
memsw.limit=unlimited</span><br>
<span>[2020-07-01T16:19:06.623] [801777.batch] task/cgroup:
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB
mem.limit=4096MB memsw.limit=unlimited</span><br>
<span>[2020-07-01T16:19:19.385] [801777.batch] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0</span><br>
<span>[2020-07-01T16:19:19.389] [801777.batch] done with job</span><br>
<span>[2020-07-01T16:19:19.463] [801777.extern]
_oom_event_monitor: oom-kill event count: 1</span><br>
<span>[2020-07-01T16:19:19.508] [801777.extern] done with job</span><br>
<span></span><br>
<span>Any ideas why the jobs are failing on just these two
nodes, while they run just fine on many other nodes?</span><br>
<span></span><br>
<span>For now, the user is excluding these two nodes using the
-x option to sbatch, but I'd really like to understand
what's going on here.</span><br>
<span></span><br>
<span>-- </span><br>
<span></span><br>
<span>Prentice</span><br>
<span></span><br>
<span></span><br>
</div>
</blockquote>
</blockquote>
<pre class="moz-signature" cols="72">--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
<a class="moz-txt-link-freetext" href="http://www.pppl.gov">http://www.pppl.gov</a></pre>
</body>
</html>