[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Tue Oct 8 10:43:04 UTC 2019

> On 19-10-08 10:36, Juergen Salk wrote:
> > * Bjørn-Helge Mevik <b.h.mevik at usit.uio.no> [191008 08:34]:
> > > Jean-mathieu CHANTREIN <jean-mathieu.chantrein at univ-angers.fr> writes:
> > > 
> > > > I tried using, in slurm.conf 
> > > > TaskPlugin=task/affinity, task/cgroup 
> > > > SelectTypeParameters=CR_CPU_Memory 
> > > > MemLimitEnforce=yes 
> > > >
> > > > and in cgroup.conf: 
> > > > CgroupAutomount=yes 
> > > > ConstrainCores=yes 
> > > > ConstrainRAMSpace=yes 
> > > > ConstrainSwapSpace=yes 
> > > > MaxSwapPercent=10 
> > > > TaskAffinity=no 
> > > 
> > > We have a very similar setup, the biggest difference being that we have
> > > MemLimitEnforce=no, and leave the killing to the kernel's cgroup.  For
> > > us, jobs are killed as they should. [...] 
> > 
> > that is interesting. We have a very similar setup as well. However, in
> > our Slurm test cluster I have noticed that it is not the *job* that
> > gets killed. Instead, the OOM killer terminates one (or more)
> > *processes* but keeps the job itself running in a potentially 
> > unhealthy state.
> > 
> > Is there a way to tell Slurm to terminate the whole job as soon as 
> > the first OOM kill event takes place during execution? 

* Marcus Boden <mboden at gwdg.de> [191008 10:46]:
> 
> you're looking for KillOnBadExit in the slurm.conf:
> KillOnBadExit
>
>     If set to 1, a step will be terminated immediately if any task
>     is crashed or aborted, as indicated by a non-zero exit code.
>     With the default value of 0, if one of the processes is crashed
>     or aborted the other processes will continue to run while the
>     crashed or aborted process waits. The user can override this
>     configuration parameter by using srun's -K, --kill-on-bad-exit.
> 
> this should terminate the job if a step or a process gets oom-killed.

Hi Marcus,

thank you. I did not consider `KillOnBadExit=1´ so far.

It seems this does indeed kill the current job step if it hits the
memory limit - but then happily proceeds with the next one. 

I've also noticed that, in order to work as described above, this 
requires all the processes to be launched via srun from within the 
batch script. Right?

Admittedly, I am also somewhat scared about potential side effects
with `KillOnBadExit=1´ set in a productive environment that needs to
cope with all sorts of batch scripts. A non-zero exit code of some
process may or may not harm the batch job whereas process(es) that get
oom-killed most probably affect the job as a whole. Is
`KillOnBadExit=1´ commonly used?

Thanks again.

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471