[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

Ward Poelmans ward.poelmans at vub.be
Thu Feb 10 13:59:28 UTC 2022


Hi Paul,

On 10/02/2022 14:33, Paul Brunk wrote:
>
> Now we see a problem in which the OOM killer is in some cases
>
> predictably killing job steps who don't seem to deserve it.  In some
>
> cases these are job scripts and input files which ran fine before our
>
> Slurm upgrade.  More details follow, but that's it the issue in a
>
> nutshell.
>
I'm not sure if this is the case but it might help to keep in mind the difference between mpirun and srun.

With srun you let slurm create tasks with the appropriate mem/cpu etc limits and the mpi ranks will run directly in a task.

With mpirun you usually let your MPI distribution start on task per node which will spawn the mpi manager which will start the actual mpi program.

You might very well end up with different memory limits per process which could be the cause of your OOM issue. Especially if not all MPI ranks use the same amount of memory.

Ward
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220210/1c48a543/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4716 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220210/1c48a543/attachment-0001.bin>


More information about the slurm-users mailing list