[slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5
Paul Brunk
pbrunk at uga.edu
Mon Feb 14 14:17:55 UTC 2022
Hi:
Thanks for your feedback guys :).
We continue to find srun behaving properly re: core placement.
BTW, we've further established that only MVAPICH (and therefore also Intel MPI) jobs are encountering the OOM issue.
==
Paul Brunk, system administrator
Georgia Advanced Resource Computing Center
Enterprise IT Svcs, the University of Georgia
Paul Edmon wrote:
We also noticed the same thing with 21.08.5. In the 21.08 series SchedMD changed the way they handle cgroups to set the stage for cgroups v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 introduced a bug fix which then caused mpirun to not pin properly (particularly for older versions of MPI): https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS What we've recommended to users who have hit this was to swap over to using srun instead of mpirun and the situation clears up.
-Paul Edmon-
On 2/10/2022 8:59 AM, Ward Poelmans wrote:
I'm not sure if this is the case but it might help to keep in mind the difference between mpirun and srun.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220214/9a9b2342/attachment.htm>
More information about the slurm-users
mailing list