[slurm-users] OverMemoryKill Not Working?

Thu Oct 24 20:00:24 UTC 2019

Hello,

We are testing Slurm19.05 on Linux RHEL7.5+ with the intent to migrate from
it toTorque/Moab in the near future.

One of the things our users are used to is that when their jobs exceed the
amount of memory they requested, the job is terminated by the scheduler.
 We realize the Slurm prefers to use cgroups to contain rather than kill
the jobs but initially we need to have the kill option in place to
transition our users.

So, looking at the documentation, it appears that in 19.05, the following
needs to be set to accomplish this:

JobAcctGatherParams     = OverMemoryKill

Other possibly relevant settings we made:

JobAcctGatherType       = jobacct_gather/linux

ProctrackType           = proctrack/linuxproc

We have avoided configuring any cgroup parameters for the time being.

Unfortunately, when we submit a job with the following:

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --mem=1GBB

We see RSS ofthe  job steadily increase beyond the 1GB limit and it is
never killed.    Interestingly enough, the proc information shows the
ulimit (hard and soft) for the process set to around 1GB.

We have tried various settings without any success.   Can anyone point out
what we are doing wrong?

Thanks,

Mike

-- 
*J. Michael Mosley*
University Research Computing
The University of North Carolina at Charlotte
9201 University City Blvd
Charlotte, NC  28223
*704.687.7065 *    * jmmosley at uncc.edu <mmosley at uncc.edu>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191024/4bf00592/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5329 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191024/4bf00592/attachment.bin>