[slurm-users] OverMemoryKill Not Working?

Fri Oct 25 13:34:47 UTC 2019

Hi;

The Slurm documentation at these pages:

https://slurm.schedmd.com/slurm.conf.html

https://slurm.schedmd.com/cons_res_share.html

conflict with the slurm 19.05 release notes at this page:

https://slurm.schedmd.com/news.html

Probably the documentation pages are obsolete. But, I don't know any 
valid document  which describe the slurm version 19.05.

Regards,

Ahmet M.

On 25.10.2019 16:17, Mike Mosley wrote:
> Ahmet,
>
> Thank you for taking the time to respond to my question.
>
> Yes, the --mem=1GBB is a typo.   It's correct in my script, I just 
> fat-fingered it in the email. :-)
>
> BTW, the exact version I am using is 19.05.*2.*
>
> Regarding your response, it seems that that might be more than what I 
> need.   I simply want to enforce the memory limits as specified by the 
> user at job submission time.   This seems to have been the behavior in 
> previous versions of Slurm.  What I want is what is described in the 
> 19.05 release notes:
>
> /RELEASE NOTES FOR SLURM VERSION 19.05
> 28 May 2019
> /
> /
> /
> /NOTE: slurmd and slurmctld will now fatal if two incompatible 
> mechanisms for
>       enforcing memory limits are set. This makes incompatible the use of
>       task/cgroup memory limit enforcing 
> (Constrain[RAM|Swap]Space=yes) with
>       JobAcctGatherParams=OverMemoryKill, which could cause problems 
> when a
>       task is killed by one of them while the other is at the same time
>       managing that task. The NoOverMemoryKill setting has been 
> deprecated in
>       favor of OverMemoryKill, since now the default is *NOT* to have any
>       memory enforcement mechanism.
>
> NOTE: MemLimitEnforce parameter has been removed and the functionality 
> that
>       was provided with it has been merged into a JobAcctGatherParams. It
>       may be enabled by setting JobAcctGatherParams=OverMemoryKill, so now
>       job and steps killing by OOM is enabled from the same place.
> /
> //
>
>
> So, is it really necessary to do what you suggested to get that 
> functionality?
>
> If someone could post just a simple slurm.conf file that forces the 
> memory limits to be honored (and kills the job if they are exceeded), 
> then I could extract what I need from that.
>
> Again, thanks for the assistance.
>
> Mike
>
>
>
> On Thu, Oct 24, 2019 at 11:27 PM mercan <ahmet.mercan at uhem.itu.edu.tr 
> <mailto:ahmet.mercan at uhem.itu.edu.tr>> wrote:
>
>     Hi;
>
>     You should set
>
>     SelectType=select/cons_res
>
>     and plus one of these:
>
>     SelectTypeParameters=CR_Memory
>     SelectTypeParameters=CR_Core_Memory
>     SelectTypeParameters=CR_CPU_Memory
>     SelectTypeParameters=CR_Socket_Memory
>
>     to open Memory allocation tracking according to documentation:
>
>     https://slurm.schedmd.com/cons_res_share.html
>
>     Also, the line:
>
>     #SBATCH --mem=1GBB
>
>     contains "1GBB". Is this same at job script?
>
>
>     Regards;
>
>     Ahmet M.
>
>
>     24.10.2019 23:00 tarihinde Mike Mosley yazdı:
>     > Hello,
>     >
>     > We are testing Slurm19.05 on Linux RHEL7.5+ with the intent to
>     migrate
>     > from it toTorque/Moab in the near future.
>     >
>     > One of the things our users are used to is that when their jobs
>     exceed
>     > the amount of memory they requested, the job is terminated by the
>     > scheduler.   We realize the Slurm prefers to use cgroups to contain
>     > rather than kill the jobs but initially we need to have the kill
>     > option in place to transition our users.
>     >
>     > So, looking at the documentation, it appears that in 19.05, the
>     > following needs to be set to accomplish this:
>     >
>     > JobAcctGatherParams = OverMemoryKill
>     >
>     >
>     > Other possibly relevant settings we made:
>     >
>     > JobAcctGatherType = jobacct_gather/linux
>     >
>     > ProctrackType = proctrack/linuxproc
>     >
>     >
>     > We have avoided configuring any cgroup parameters for the time
>     being.
>     >
>     > Unfortunately, when we submit a job with the following:
>     >
>     > #SBATCH --nodes=1
>     >
>     > #SBATCH --ntasks-per-node=1
>     >
>     > #SBATCH --mem=1GBB
>     >
>     >
>     > We see RSS ofthe  job steadily increase beyond the 1GB limit and
>     it is
>     > never killed.    Interestingly enough, the proc information
>     shows the
>     > ulimit (hard and soft) for the process set to around 1GB.
>     >
>     > We have tried various settings without any success.   Can anyone
>     point
>     > out what we are doing wrong?
>     >
>     > Thanks,
>     >
>     > Mike
>     >
>     > --
>     > */J. Michael Mosley/*
>     > University Research Computing
>     > The University of North Carolina at Charlotte
>     > 9201 University City Blvd
>     > Charlotte, NC  28223
>     > _704.687.7065 _ _ j/mmosley at uncc.edu <mailto:mmosley at uncc.edu>
>     <mailto:mmosley at uncc.edu <mailto:mmosley at uncc.edu>>/_
>
>
>
> -- 
> */J. Michael Mosley/*
> University Research Computing
> The University of North Carolina at Charlotte
> 9201 University City Blvd
> Charlotte, NC  28223
> _704.687.7065 _ _ j/mmosley at uncc.edu <mailto:mmosley at uncc.edu>/_