[slurm-users] [External] Re: Jobs killed by OOM-killer only on certain nodes.
Prentice Bisbal
pbisbal at pppl.gov
Thu Jul 2 15:04:19 UTC 2020
Not 100%, which is why I'm asking here.I searched the log files and that
line was only present after a handful of jobs, including the ones I'm
investigating, so it's not something happening after/to every job.
However, this is happening on nodes with plenty of RAM, so if the OOM
Killer is being invoked, something odd is definitely going on.
On 7/2/20 10:20 AM, Ryan Novosielski wrote:
> Are you sure that the OOM killer is involved? I can get you specifics
> later, but if it’s that one line about OOM events, you may see it
> after successful jobs too. I just had a SLURM bug where this came up.
>
> --
> ____
> || \\UTGERS, |---------------------------*O*---------------------------
> ||_// the State | Ryan Novosielski - novosirj at rutgers.edu
> <mailto:novosirj at rutgers.edu>
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> || \\ of NJ | Office of Advanced Research Computing - MSB
> C630, Newark
> `'
>
>> On Jul 2, 2020, at 09:53, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>
>> I maintain a very heterogeneous cluster (different processors,
>> different amounts of RAM, etc.) I have a user reporting the following
>> problem.
>>
>> He's running the same job multiple times with different input
>> parameters. The jobs run fine unless they land on specific nodes.
>> He's specifying --mem=2G in his sbatch files. On the nodes where the
>> jobs fail, I see that the OOM killer is invoked, so I asked him to
>> specify more RAM, so he did. He set --mem=4G, and still the jobs fail
>> on these 2 nodes. However, they run just fine on other nodes with
>> --mem=2G.
>>
>> When I look at the slurm log file on the nodes, I see something like
>> this for a failing job (in this case, --mem=4G was set)
>>
>> [2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job
>> 801777 ran for 0 seconds
>> [2020-07-01T16:19:06.479] [801777.extern] task/cgroup:
>> /slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
>> memsw.limit=unlimited
>> [2020-07-01T16:19:06.483] [801777.extern] task/cgroup:
>> /slurm/uid_40324/job_801777/step_extern: alloc=4096MB
>> mem.limit=4096MB memsw.limit=unlimited
>> [2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
>> [2020-07-01T16:19:06.621] [801777.batch] task/cgroup:
>> /slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
>> memsw.limit=unlimited
>> [2020-07-01T16:19:06.623] [801777.batch] task/cgroup:
>> /slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB
>> memsw.limit=unlimited
>> [2020-07-01T16:19:19.385] [801777.batch] sending
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
>> [2020-07-01T16:19:19.389] [801777.batch] done with job
>> [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor:
>> oom-kill event count: 1
>> [2020-07-01T16:19:19.508] [801777.extern] done with job
>>
>> Any ideas why the jobs are failing on just these two nodes, while
>> they run just fine on many other nodes?
>>
>> For now, the user is excluding these two nodes using the -x option to
>> sbatch, but I'd really like to understand what's going on here.
>>
>> --
>>
>> Prentice
>>
>>
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200702/55777483/attachment-0001.htm>
More information about the slurm-users
mailing list