[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5
pbrunk at uga.edu
Mon Feb 14 14:17:55 UTC 2022
Thanks for your feedback guys :).
We continue to find srun behaving properly re: core placement.
BTW, we've further established that only MVAPICH (and therefore also Intel MPI) jobs are encountering the OOM issue.
Paul Brunk, system administrator
Georgia Advanced Resource Computing Center
Enterprise IT Svcs, the University of Georgia
Paul Edmon wrote:
We also noticed the same thing with 21.08.5. In the 21.08 series SchedMD changed the way they handle cgroups to set the stage for cgroups v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 introduced a bug fix which then caused mpirun to not pin properly (particularly for older versions of MPI): https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS What we've recommended to users who have hit this was to swap over to using srun instead of mpirun and the situation clears up.
On 2/10/2022 8:59 AM, Ward Poelmans wrote:
I'm not sure if this is the case but it might help to keep in mind the difference between mpirun and srun.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users