[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Tue Oct 8 06:34:39 UTC 2019

Jean-mathieu CHANTREIN <jean-mathieu.chantrein at univ-angers.fr> writes:

> I tried using, in slurm.conf 
> TaskPlugin=task/affinity, task/cgroup 
> SelectTypeParameters=CR_CPU_Memory 
> MemLimitEnforce=yes 
>
> and in cgroup.conf: 
> CgroupAutomount=yes 
> ConstrainCores=yes 
> ConstrainRAMSpace=yes 
> ConstrainSwapSpace=yes 
> MaxSwapPercent=10 
> TaskAffinity=no 

We have a very similar setup, the biggest difference being that we have
MemLimitEnforce=no, and leave the killing to the kernel's cgroup.  For
us, jobs are killed as they should.  Here are a couple of things you
could check:

- Does it work if you remove the space in "TaskPlugin=task/affinity,
  task/cgroup"? (Slurm can be quite picky when reading slurm.conf).

- See in slurmd.log on the node(s) of the job if cgroup actually gets
  activated and starts limit memory for the job, or if there are any
  errors related to cgroup.

- While a job is running, see in the cgroup memory directory (typically
  /sys/fs/cgroup/memory/slurm/uid_<num>/job_<num> for the job (on the
  compute node).  Does the values there, for instance
  memory.limit_in_bytes and memory.max_usage_in_bytes, make sense?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191008/286d83c2/attachment.sig>