[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

Thu Feb 10 13:33:33 UTC 2022

Hello all:

We upgraded from 20.11.8 to 21.08.5 (CentOS 7.9, Slurm built without
pmix support) recently.  After that, we found that in many cases,
'mpirun' was causing multi-node MPI jobs to have all MPI ranks within
a node run on the same core.  We've moved on to 'srun'.

Now we see a problem in which the OOM killer is in some cases
predictably killing job steps who don't seem to deserve it.  In some
cases these are job scripts and input files which ran fine before our
Slurm upgrade.  More details follow, but that's it the issue in a
nutshell.

Other than the version, our one Slurm config change was to remove the
deprecated 'TaskPluginParam=Sched' from slurm.conf, giving it its
default 'null' value.  Our TaskPlugin remains
'task/affinity,task/cgroup'.

We've had apparently correct cgroup-based mem limit enforcement in
place for a long time, so the OOM-killing of the jobs I’m referencing is a
change in behavior.

Below are some of our support team's findings.  I haven't finished
trying to correlate the anomalous job events with specific OOM
complaints, or recorded job resource usage at those times.  I'm just
throwing out this message in case what we've seen so far, or the
painfully obvious thing I’m missing, looks familiar to anyone.    Thanks!

Application: VASP 6.1.2 launched with srun
MPI libraries: intel/2019b
Observations:

Test 1. QDR-fabric Intel nodes (20 nodes x 10 cores/node) outcome:
         job failed right away, no output generated error text: 20
        occurrences resembling in form "[13:ra8-10] unexpected reject
         event from 9:ra8-9"

Test 2. EDR-fabric Intel nodes (20 nodes x 10 cores/node)
         outcome: job ran for 12 minutes, generated some output data that look fine
         error text: no error messages, job failed.

Test 3. AMD Rome (20 nodes x 10 cores/node)
         outcome: job completed successfully after 31 minutes, user
         confirmed the results are fine

Application: Quantum Espresso 6.5 launched with srun
MPI libraries: intel/2019b
Observations:

- Works correctly when using: 1 node x 64 cores 64 MPI processes), 1x128 (128 MPI processes) (other
   QE parameters -nk 1 -nt 4 , mem-per-cpu=1500mb)

- A few processes get OOM killed after a while when using: 4 nodes x 32
   cores (128 MPI processes), 4 nodes x 64 cores (256 MPI processes)

- job fails within seconds:  16 nodes x 8 cores

--
Paul Brunk, system administrator
Georgia Advanced Resource Computing Center
Enterprise IT Svcs, the University of Georgia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220210/0e8574b2/attachment-0001.htm>