[slurm-users] [External] Re: Jobs killed by OOM-killer only on certain nodes.

Thu Jul 2 15:04:19 UTC 2020

Not 100%, which is why I'm asking here.I searched the log files and that 
line was only present after a handful of jobs, including the ones I'm 
investigating, so it's not something happening after/to every job. 
However, this is happening on nodes with plenty of RAM, so if the OOM 
Killer is being invoked, something odd is definitely going on.

On 7/2/20 10:20 AM, Ryan Novosielski wrote:
> Are you sure that the OOM killer is involved? I can get you specifics 
> later, but if it’s that one line about OOM events, you may see it 
> after successful jobs too. I just had a SLURM bug where this came up.
>
> --
> ____
> || \\UTGERS, |---------------------------*O*---------------------------
> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu 
> <mailto:novosirj at rutgers.edu>
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS 
> Campus
> ||  \\    of NJ     | Office of Advanced Research Computing - MSB 
> C630, Newark
>     `'
>
>> On Jul 2, 2020, at 09:53, Prentice Bisbal <pbisbal at pppl.gov> wrote:
>>
>> I maintain a very heterogeneous cluster (different processors, 
>> different amounts of RAM, etc.) I have a user reporting the following 
>> problem.
>>
>> He's running the same job multiple times with different input 
>> parameters. The jobs run fine unless they land on specific nodes. 
>> He's specifying --mem=2G in his sbatch files. On the nodes where the 
>> jobs fail, I see that the OOM killer is invoked, so I asked him to 
>> specify more RAM, so he did. He set --mem=4G, and still the jobs fail 
>> on these 2 nodes. However, they run just fine on other nodes with 
>> --mem=2G.
>>
>> When I look at the slurm log file on the nodes, I see something like 
>> this for a failing job (in this case, --mem=4G was set)
>>
>> [2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 
>> 801777 ran for 0 seconds
>> [2020-07-01T16:19:06.479] [801777.extern] task/cgroup: 
>> /slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
>> memsw.limit=unlimited
>> [2020-07-01T16:19:06.483] [801777.extern] task/cgroup: 
>> /slurm/uid_40324/job_801777/step_extern: alloc=4096MB 
>> mem.limit=4096MB memsw.limit=unlimited
>> [2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
>> [2020-07-01T16:19:06.621] [801777.batch] task/cgroup: 
>> /slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB 
>> memsw.limit=unlimited
>> [2020-07-01T16:19:06.623] [801777.batch] task/cgroup: 
>> /slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB 
>> memsw.limit=unlimited
>> [2020-07-01T16:19:19.385] [801777.batch] sending 
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
>> [2020-07-01T16:19:19.389] [801777.batch] done with job
>> [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: 
>> oom-kill event count: 1
>> [2020-07-01T16:19:19.508] [801777.extern] done with job
>>
>> Any ideas why the jobs are failing on just these two nodes, while 
>> they run just fine on many other nodes?
>>
>> For now, the user is excluding these two nodes using the -x option to 
>> sbatch, but I'd really like to understand what's going on here.
>>
>> -- 
>>
>> Prentice
>>
>>
-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200702/55777483/attachment-0001.htm>