[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Tue Oct 8 08:36:15 UTC 2019

* Bjørn-Helge Mevik <b.h.mevik at usit.uio.no> [191008 08:34]:
> Jean-mathieu CHANTREIN <jean-mathieu.chantrein at univ-angers.fr> writes:
> 
> > I tried using, in slurm.conf 
> > TaskPlugin=task/affinity, task/cgroup 
> > SelectTypeParameters=CR_CPU_Memory 
> > MemLimitEnforce=yes 
> >
> > and in cgroup.conf: 
> > CgroupAutomount=yes 
> > ConstrainCores=yes 
> > ConstrainRAMSpace=yes 
> > ConstrainSwapSpace=yes 
> > MaxSwapPercent=10 
> > TaskAffinity=no 
> 
> We have a very similar setup, the biggest difference being that we have
> MemLimitEnforce=no, and leave the killing to the kernel's cgroup.  For
> us, jobs are killed as they should. [...] 

Hello Bjørn-Helge,

that is interesting. We have a very similar setup as well. However, in
our Slurm test cluster I have noticed that it is not the *job* that
gets killed. Instead, the OOM killer terminates one (or more)
*processes* but keeps the job itself running in a potentially 
unhealthy state.

Is there a way to tell Slurm to terminate the whole job as soon as 
the first OOM kill event takes place during execution? 

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471