[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?
Bjørn-Helge Mevik
b.h.mevik at usit.uio.no
Tue Oct 8 06:34:39 UTC 2019
Jean-mathieu CHANTREIN <jean-mathieu.chantrein at univ-angers.fr> writes:
> I tried using, in slurm.conf
> TaskPlugin=task/affinity, task/cgroup
> SelectTypeParameters=CR_CPU_Memory
> MemLimitEnforce=yes
>
> and in cgroup.conf:
> CgroupAutomount=yes
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> MaxSwapPercent=10
> TaskAffinity=no
We have a very similar setup, the biggest difference being that we have
MemLimitEnforce=no, and leave the killing to the kernel's cgroup. For
us, jobs are killed as they should. Here are a couple of things you
could check:
- Does it work if you remove the space in "TaskPlugin=task/affinity,
task/cgroup"? (Slurm can be quite picky when reading slurm.conf).
- See in slurmd.log on the node(s) of the job if cgroup actually gets
activated and starts limit memory for the job, or if there are any
errors related to cgroup.
- While a job is running, see in the cgroup memory directory (typically
/sys/fs/cgroup/memory/slurm/uid_<num>/job_<num> for the job (on the
compute node). Does the values there, for instance
memory.limit_in_bytes and memory.max_usage_in_bytes, make sense?
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191008/286d83c2/attachment.sig>
More information about the slurm-users
mailing list