[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?
jean-mathieu.chantrein at univ-angers.fr
Tue Oct 8 10:53:24 UTC 2019
Hello, thanks for you answers,
> - Does it work if you remove the space in "TaskPlugin=task/affinity,
> task/cgroup"? (Slurm can be quite picky when reading slurm.conf).
It was the case, I make a mistake when I copy/cut... So, I haven't space here.
> - See in slurmd.log on the node(s) of the job if cgroup actually gets
> activated and starts limit memory for the job, or if there are any
> errors related to cgroup.
Launching batch job 1605839 for UID <num>
[1605839.batch] task/cgroup: /slurm/uid_<num>/job_1605839: alloc=200MB mem.limit=200MB memsw.limit=200MB
[1605839.batch] task/cgroup: /slurm/uid_<num>/job_1605839/step_batch: alloc=200MB mem.limit=200MB memsw.limit=200MB
> - While a job is running, see in the cgroup memory directory (typically
> /sys/fs/cgroup/memory/slurm/uid_<num>/job_<num> for the job (on the
> compute node). Does the values there, for instance
> memory.limit_in_bytes and memory.max_usage_in_bytes, make sense?
Yes, for the same job:
root at star190:~# cat /sys/fs/cgroup/memory/slurm/uid_<num>/job_1605839/memory.max_usage_in_bytes
is always under memory.max_usage_in_bytes. I think it's because the field ConstrainRAMSpace=yes in cgroup.conf, and the process swap (with ConstrainRAMSpace=no)... I try configuration of Michael Renfro in precedent email, but when ConstrainRAMSpace=no and ConstrainSwapSpace=no, cgroup are no activate for the job (nothing appears in slurm.log or /sys/fs/cgroup/memory/slurm/uid_<num>/ ) Set the MemEnforceLimit to no or yes seem to be have no influence...
Maybe I missed something else...
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
More information about the slurm-users