[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?
Juergen Salk
juergen.salk at uni-ulm.de
Tue Oct 8 08:36:15 UTC 2019
* Bjørn-Helge Mevik <b.h.mevik at usit.uio.no> [191008 08:34]:
> Jean-mathieu CHANTREIN <jean-mathieu.chantrein at univ-angers.fr> writes:
>
> > I tried using, in slurm.conf
> > TaskPlugin=task/affinity, task/cgroup
> > SelectTypeParameters=CR_CPU_Memory
> > MemLimitEnforce=yes
> >
> > and in cgroup.conf:
> > CgroupAutomount=yes
> > ConstrainCores=yes
> > ConstrainRAMSpace=yes
> > ConstrainSwapSpace=yes
> > MaxSwapPercent=10
> > TaskAffinity=no
>
> We have a very similar setup, the biggest difference being that we have
> MemLimitEnforce=no, and leave the killing to the kernel's cgroup. For
> us, jobs are killed as they should. [...]
Hello Bjørn-Helge,
that is interesting. We have a very similar setup as well. However, in
our Slurm test cluster I have noticed that it is not the *job* that
gets killed. Instead, the OOM killer terminates one (or more)
*processes* but keeps the job itself running in a potentially
unhealthy state.
Is there a way to tell Slurm to terminate the whole job as soon as
the first OOM kill event takes place during execution?
Best regards
Jürgen
--
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471
More information about the slurm-users
mailing list