[slurm-users] Jobs killed by OOM-killer only on certain nodes.
Prentice Bisbal
pbisbal at pppl.gov
Thu Jul 2 13:52:15 UTC 2020
I maintain a very heterogeneous cluster (different processors, different
amounts of RAM, etc.) I have a user reporting the following problem.
He's running the same job multiple times with different input
parameters. The jobs run fine unless they land on specific nodes. He's
specifying --mem=2G in his sbatch files. On the nodes where the jobs
fail, I see that the OOM killer is invoked, so I asked him to specify
more RAM, so he did. He set --mem=4G, and still the jobs fail on these 2
nodes. However, they run just fine on other nodes with --mem=2G.
When I look at the slurm log file on the nodes, I see something like
this for a failing job (in this case, --mem=4G was set)
[2020-07-01T16:19:06.222] _run_prolog: prolog with lock for job 801777
ran for 0 seconds
[2020-07-01T16:19:06.479] [801777.extern] task/cgroup:
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
memsw.limit=unlimited
[2020-07-01T16:19:06.483] [801777.extern] task/cgroup:
/slurm/uid_40324/job_801777/step_extern: alloc=4096MB mem.limit=4096MB
memsw.limit=unlimited
[2020-07-01T16:19:06.506] Launching batch job 801777 for UID 40324
[2020-07-01T16:19:06.621] [801777.batch] task/cgroup:
/slurm/uid_40324/job_801777: alloc=4096MB mem.limit=4096MB
memsw.limit=unlimited
[2020-07-01T16:19:06.623] [801777.batch] task/cgroup:
/slurm/uid_40324/job_801777/step_batch: alloc=4096MB mem.limit=4096MB
memsw.limit=unlimited
[2020-07-01T16:19:19.385] [801777.batch] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2020-07-01T16:19:19.389] [801777.batch] done with job
[2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill
event count: 1
[2020-07-01T16:19:19.508] [801777.extern] done with job
Any ideas why the jobs are failing on just these two nodes, while they
run just fine on many other nodes?
For now, the user is excluding these two nodes using the -x option to
sbatch, but I'd really like to understand what's going on here.
--
Prentice
More information about the slurm-users
mailing list