[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Tue Oct 8 10:53:24 UTC 2019

Hello, thanks for you answers,

> - Does it work if you remove the space in "TaskPlugin=task/affinity,
>  task/cgroup"? (Slurm can be quite picky when reading slurm.conf).

It was the case, I make a mistake when I copy/cut... So, I haven't space here.

> 
> - See in slurmd.log on the node(s) of the job if cgroup actually gets
>  activated and starts limit memory for the job, or if there are any
>  errors related to cgroup.

Yes, example:
Launching batch job 1605839 for UID <num>
[1605839.batch] task/cgroup: /slurm/uid_<num>/job_1605839: alloc=200MB mem.limit=200MB memsw.limit=200MB
[1605839.batch] task/cgroup: /slurm/uid_<num>/job_1605839/step_batch: alloc=200MB mem.limit=200MB memsw.limit=200MB

> 
> - While a job is running, see in the cgroup memory directory (typically
>  /sys/fs/cgroup/memory/slurm/uid_<num>/job_<num> for the job (on the
>  compute node).  Does the values there, for instance
>  memory.limit_in_bytes and memory.max_usage_in_bytes, make sense?

Yes, for the same job:
cat /sys/fs/cgroup/memory/slurm/uid_<num>/job_1605839/memory.limit_in_bytes 
209715200
root at star190:~# cat /sys/fs/cgroup/memory/slurm/uid_<num>/job_1605839/memory.max_usage_in_bytes 
209715200

But:

cat /sys/fs/cgroup/memory/slurm/uid_<num>/job_1605839/memory.usage_in_bytes 
209711104

is always under memory.max_usage_in_bytes. I think it's because the field ConstrainRAMSpace=yes in cgroup.conf, and the process swap (with ConstrainRAMSpace=no)... I try configuration of Michael Renfro in precedent email, but when ConstrainRAMSpace=no and ConstrainSwapSpace=no, cgroup are no activate for the job (nothing appears in slurm.log or /sys/fs/cgroup/memory/slurm/uid_<num>/ ) Set the MemEnforceLimit to no or yes seem to be have no influence...

Maybe I missed something else...

Regards,

Jean-Mathieu

> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo