[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Bjørn-Helge Mevik b.h.mevik at usit.uio.no
Tue Oct 8 11:10:53 UTC 2019


Marcus Boden <mboden at gwdg.de> writes:

> you're looking for KillOnBadExit in the slurm.conf:
> KillOnBadExit

[...]

> this should terminate the job if a step or a process gets oom-killed.

That is a good tip!

But as I read the documentation (I haven't tested it), it will only kill
the job step itself, it will not kill the whole job.  Also, it will only
have effect for things started with srun, mpirun or similar.  However,
in combination with "set -o errexit", I believe most OOM kills would get
the job itself terminated.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191008/91c83358/attachment.sig>


More information about the slurm-users mailing list