[slurm-users] OverMemoryKill Not Working?

Juergen Salk juergen.salk at uni-ulm.de
Fri Oct 25 15:25:56 UTC 2019


Hi Mike,

IIRC, I once did some tests with the very same configuration as
your's, i.e. `JobAcctGatherType=jobacct_gather/linux´ and
`JobAcctGatherParams=OverMemoryKill´ and got this to work as expected:
Jobs were killed when they exceeded the requested amount of memory.
This was with Slurm 18.08.7. After some tests I went back 
to memory enforcement with cgroups as this also keeps memory into 
account that is consumed by writing data to a tmpfs filesystem, such 
as /dev/shm.  

I have now restored the old configuration, that I think I've used to
experiment with Slurm's capabilities to enforce memory usage without
cgroups and then tried again in my test cluster (now running Slurm
19.05.2). As far as I understand, the configuration above should 
also work with 19.05.2.

But I was surprised to see, that I can also reproduce the 
behavior that you described: A process that exceeds the 
requested amount of memory keeps happily running. 

Anyway, I think memory enforcement with cgroups is more reliable
and, thus, more commonly used these days. Recently there was an
interesting discussion on this list about how to get Slurm to cancel
the whole job if the memory is exceeded (not just oom-kill some
processes). Someone suggested setting `KillOnBadExit=1´ in
slurm.conf. Someone else suggested using `set -o errexit´ (or
#!/bin/bash -e instead of plain #!/bin/bash) in the job scripts, so
that the failure of any command within the script will cause the job
to stop immediately.

You may find the thread in the list archive if you search for 
"How to automatically kill a job that exceeds its memory limits".

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471



* Mike Mosley <Mike.Mosley at uncc.edu> [191025 09:17]:
> Ahmet,
> 
> Thank you for taking the time to respond to my question.
> 
> Yes, the --mem=1GBB is a typo.   It's correct in my script, I just
> fat-fingered it in the email. :-)
> 
> BTW, the exact version I am using is 19.05.*2.*
> 
> Regarding your response, it seems that that might be more than what I
> need.   I simply want to enforce the memory limits as specified by the user
> at job submission time.   This seems to have been the behavior in previous
> versions of Slurm.   What I want is what is described in the 19.05 release
> notes:
> 
> 
> 
> *RELEASE NOTES FOR SLURM VERSION 19.0528 May 2019*
> 
> 
> 
> *NOTE: slurmd and slurmctld will now fatal if two incompatible mechanisms
> for      enforcing memory limits are set. This makes incompatible the use
> of      task/cgroup memory limit enforcing (Constrain[RAM|Swap]Space=yes)
> with      JobAcctGatherParams=OverMemoryKill, which could cause problems
> when a      task is killed by one of them while the other is at the same
> time      managing that task. The NoOverMemoryKill setting has been
> deprecated in      favor of OverMemoryKill, since now the default is *NOT*
> to have any      memory enforcement mechanism.NOTE: MemLimitEnforce
> parameter has been removed and the functionality that      was provided
> with it has been merged into a JobAcctGatherParams. It      may be enabled
> by setting JobAcctGatherParams=OverMemoryKill, so now      job and steps
> killing by OOM is enabled from the same place.*
> 
> 
> 
> So, is it really necessary to do what you suggested to get that
> functionality?
> 
> If someone could post just a simple slurm.conf file that forces the memory
> limits to be honored (and kills the job if they are exceeded), then I could
> extract what I need from that.
> 
> Again, thanks for the assistance.
> 
> Mike
> 
> 
> 
> On Thu, Oct 24, 2019 at 11:27 PM mercan <ahmet.mercan at uhem.itu.edu.tr>
> wrote:
> 
> > Hi;
> >
> > You should set
> >
> > SelectType=select/cons_res
> >
> > and plus one of these:
> >
> > SelectTypeParameters=CR_Memory
> > SelectTypeParameters=CR_Core_Memory
> > SelectTypeParameters=CR_CPU_Memory
> > SelectTypeParameters=CR_Socket_Memory
> >
> > to open Memory allocation tracking according to documentation:
> >
> > https://slurm.schedmd.com/cons_res_share.html
> >
> > Also, the line:
> >
> > #SBATCH --mem=1GBB
> >
> > contains "1GBB". Is this same at job script?
> >
> >
> > Regards;
> >
> > Ahmet M.
> >
> >
> > 24.10.2019 23:00 tarihinde Mike Mosley yazdı:
> > > Hello,
> > >
> > > We are testing Slurm19.05 on Linux RHEL7.5+ with the intent to migrate
> > > from it toTorque/Moab in the near future.
> > >
> > > One of the things our users are used to is that when their jobs exceed
> > > the amount of memory they requested, the job is terminated by the
> > > scheduler.   We realize the Slurm prefers to use cgroups to contain
> > > rather than kill the jobs but initially we need to have the kill
> > > option in place to transition our users.
> > >
> > > So, looking at the documentation, it appears that in 19.05, the
> > > following needs to be set to accomplish this:
> > >
> > > JobAcctGatherParams = OverMemoryKill
> > >
> > >
> > > Other possibly relevant settings we made:
> > >
> > > JobAcctGatherType = jobacct_gather/linux
> > >
> > > ProctrackType = proctrack/linuxproc
> > >
> > >
> > > We have avoided configuring any cgroup parameters for the time being.
> > >
> > > Unfortunately, when we submit a job with the following:
> > >
> > > #SBATCH --nodes=1
> > >
> > > #SBATCH --ntasks-per-node=1
> > >
> > > #SBATCH --mem=1GBB
> > >
> > >
> > > We see RSS ofthe  job steadily increase beyond the 1GB limit and it is
> > > never killed.    Interestingly enough, the proc information shows the
> > > ulimit (hard and soft) for the process set to around 1GB.
> > >
> > > We have tried various settings without any success.   Can anyone point
> > > out what we are doing wrong?
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > --
> > > */J. Michael Mosley/*
> > > University Research Computing
> > > The University of North Carolina at Charlotte
> > > 9201 University City Blvd
> > > Charlotte, NC  28223
> > > _704.687.7065 _ _ j/mmosley at uncc.edu <mailto:mmosley at uncc.edu>/_
> >
> 
> 
> -- 
> *J. Michael Mosley*
> University Research Computing
> The University of North Carolina at Charlotte
> 9201 University City Blvd
> Charlotte, NC  28223
> *704.687.7065 *    * jmmosley at uncc.edu <mmosley at uncc.edu>*



-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A



More information about the slurm-users mailing list