<div dir="ltr">Jurgen,<div><br></div><div>Thank you for all of the information.  I appreciate you taking the time to test the configuration with 19.05.   </div><div>I feel a little better about my efforts now. :-)</div><div><br></div><div>I will check out some of your suggestions to mitigate the issue.    Ultimately, we will probably use cgroups for containment but as I explained in my earlier post, we wanted to have the kill option available initially and it seemed like it should have been simple to set up.</div><div><br></div><div>It would be interesting to know if anyone else could comment as to 19.05 seeming to act differently than previous versions in light of  the release notes/ documentation.</div><div><br></div><div>Again, thank your time.   It was very helpful.</div><div><br></div><div>Mike</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 25, 2019 at 11:26 AM Juergen Salk <<a href="mailto:juergen.salk@uni-ulm.de">juergen.salk@uni-ulm.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Mike,<br>

<br>

IIRC, I once did some tests with the very same configuration as<br>

your's, i.e. `JobAcctGatherType=jobacct_gather/linux´ and<br>

`JobAcctGatherParams=OverMemoryKill´ and got this to work as expected:<br>

Jobs were killed when they exceeded the requested amount of memory.<br>

This was with Slurm 18.08.7. After some tests I went back <br>

to memory enforcement with cgroups as this also keeps memory into <br>

account that is consumed by writing data to a tmpfs filesystem, such <br>

as /dev/shm.  <br>

<br>

I have now restored the old configuration, that I think I've used to<br>

experiment with Slurm's capabilities to enforce memory usage without<br>

cgroups and then tried again in my test cluster (now running Slurm<br>

19.05.2). As far as I understand, the configuration above should <br>

also work with 19.05.2.<br>

<br>

But I was surprised to see, that I can also reproduce the <br>

behavior that you described: A process that exceeds the <br>

requested amount of memory keeps happily running. <br>

<br>

Anyway, I think memory enforcement with cgroups is more reliable<br>

and, thus, more commonly used these days. Recently there was an<br>

interesting discussion on this list about how to get Slurm to cancel<br>

the whole job if the memory is exceeded (not just oom-kill some<br>

processes). Someone suggested setting `KillOnBadExit=1´ in<br>

slurm.conf. Someone else suggested using `set -o errexit´ (or<br>

#!/bin/bash -e instead of plain #!/bin/bash) in the job scripts, so<br>

that the failure of any command within the script will cause the job<br>

to stop immediately.<br>

<br>

You may find the thread in the list archive if you search for <br>

"How to automatically kill a job that exceeds its memory limits".<br>

<br>

Best regards<br>

Jürgen<br>

<br>

-- <br>

Jürgen Salk<br>

Scientific Software & Compute Services (SSCS)<br>

Kommunikations- und Informationszentrum (kiz)<br>

Universität Ulm<br>

Telefon: +49 (0)731 50-22478<br>

Telefax: +49 (0)731 50-22471<br>

<br>

<br>

<br>

* Mike Mosley <<a href="mailto:Mike.Mosley@uncc.edu" target="_blank">Mike.Mosley@uncc.edu</a>> [191025 09:17]:<br>

> Ahmet,<br>

> <br>

> Thank you for taking the time to respond to my question.<br>

> <br>

> Yes, the --mem=1GBB is a typo.   It's correct in my script, I just<br>

> fat-fingered it in the email. :-)<br>

> <br>

> BTW, the exact version I am using is 19.05.*2.*<br>

> <br>

> Regarding your response, it seems that that might be more than what I<br>

> need.   I simply want to enforce the memory limits as specified by the user<br>

> at job submission time.   This seems to have been the behavior in previous<br>

> versions of Slurm.   What I want is what is described in the 19.05 release<br>

> notes:<br>

> <br>

> <br>

> <br>

> *RELEASE NOTES FOR SLURM VERSION 19.0528 May 2019*<br>

> <br>

> <br>

> <br>

> *NOTE: slurmd and slurmctld will now fatal if two incompatible mechanisms<br>

> for      enforcing memory limits are set. This makes incompatible the use<br>

> of      task/cgroup memory limit enforcing (Constrain[RAM|Swap]Space=yes)<br>

> with      JobAcctGatherParams=OverMemoryKill, which could cause problems<br>

> when a      task is killed by one of them while the other is at the same<br>

> time      managing that task. The NoOverMemoryKill setting has been<br>

> deprecated in      favor of OverMemoryKill, since now the default is *NOT*<br>

> to have any      memory enforcement mechanism.NOTE: MemLimitEnforce<br>

> parameter has been removed and the functionality that      was provided<br>

> with it has been merged into a JobAcctGatherParams. It      may be enabled<br>

> by setting JobAcctGatherParams=OverMemoryKill, so now      job and steps<br>

> killing by OOM is enabled from the same place.*<br>

> <br>

> <br>

> <br>

> So, is it really necessary to do what you suggested to get that<br>

> functionality?<br>

> <br>

> If someone could post just a simple slurm.conf file that forces the memory<br>

> limits to be honored (and kills the job if they are exceeded), then I could<br>

> extract what I need from that.<br>

> <br>

> Again, thanks for the assistance.<br>

> <br>

> Mike<br>

> <br>

> <br>

> <br>

> On Thu, Oct 24, 2019 at 11:27 PM mercan <<a href="mailto:ahmet.mercan@uhem.itu.edu.tr" target="_blank">ahmet.mercan@uhem.itu.edu.tr</a>><br>

> wrote:<br>

> <br>

> > Hi;<br>

> ><br>

> > You should set<br>

> ><br>

> > SelectType=select/cons_res<br>

> ><br>

> > and plus one of these:<br>

> ><br>

> > SelectTypeParameters=CR_Memory<br>

> > SelectTypeParameters=CR_Core_Memory<br>

> > SelectTypeParameters=CR_CPU_Memory<br>

> > SelectTypeParameters=CR_Socket_Memory<br>

> ><br>

> > to open Memory allocation tracking according to documentation:<br>

> ><br>

> > <a href="https://slurm.schedmd.com/cons_res_share.html" rel="noreferrer" target="_blank">https://slurm.schedmd.com/cons_res_share.html</a><br>

> ><br>

> > Also, the line:<br>

> ><br>

> > #SBATCH --mem=1GBB<br>

> ><br>

> > contains "1GBB". Is this same at job script?<br>

> ><br>

> ><br>

> > Regards;<br>

> ><br>

> > Ahmet M.<br>

> ><br>

> ><br>

> > 24.10.2019 23:00 tarihinde Mike Mosley yazdı:<br>

> > > Hello,<br>

> > ><br>

> > > We are testing Slurm19.05 on Linux RHEL7.5+ with the intent to migrate<br>

> > > from it toTorque/Moab in the near future.<br>

> > ><br>

> > > One of the things our users are used to is that when their jobs exceed<br>

> > > the amount of memory they requested, the job is terminated by the<br>

> > > scheduler.   We realize the Slurm prefers to use cgroups to contain<br>

> > > rather than kill the jobs but initially we need to have the kill<br>

> > > option in place to transition our users.<br>

> > ><br>

> > > So, looking at the documentation, it appears that in 19.05, the<br>

> > > following needs to be set to accomplish this:<br>

> > ><br>

> > > JobAcctGatherParams = OverMemoryKill<br>

> > ><br>

> > ><br>

> > > Other possibly relevant settings we made:<br>

> > ><br>

> > > JobAcctGatherType = jobacct_gather/linux<br>

> > ><br>

> > > ProctrackType = proctrack/linuxproc<br>

> > ><br>

> > ><br>

> > > We have avoided configuring any cgroup parameters for the time being.<br>

> > ><br>

> > > Unfortunately, when we submit a job with the following:<br>

> > ><br>

> > > #SBATCH --nodes=1<br>

> > ><br>

> > > #SBATCH --ntasks-per-node=1<br>

> > ><br>

> > > #SBATCH --mem=1GBB<br>

> > ><br>

> > ><br>

> > > We see RSS ofthe  job steadily increase beyond the 1GB limit and it is<br>

> > > never killed.    Interestingly enough, the proc information shows the<br>

> > > ulimit (hard and soft) for the process set to around 1GB.<br>

> > ><br>

> > > We have tried various settings without any success.   Can anyone point<br>

> > > out what we are doing wrong?<br>

> > ><br>

> > > Thanks,<br>

> > ><br>

> > > Mike<br>

> > ><br>

> > > --<br>

> > > */J. Michael Mosley/*<br>

> > > University Research Computing<br>

> > > The University of North Carolina at Charlotte<br>

> > > 9201 University City Blvd<br>

> > > Charlotte, NC  28223<br>

> > > _704.687.7065 _ _ j/<a href="mailto:mmosley@uncc.edu" target="_blank">mmosley@uncc.edu</a> <mailto:<a href="mailto:mmosley@uncc.edu" target="_blank">mmosley@uncc.edu</a>>/_<br>

> ><br>

> <br>

> <br>

> -- <br>

> *J. Michael Mosley*<br>

> University Research Computing<br>

> The University of North Carolina at Charlotte<br>

> 9201 University City Blvd<br>

> Charlotte, NC  28223<br>

> *704.687.7065 *    * <a href="mailto:jmmosley@uncc.edu" target="_blank">jmmosley@uncc.edu</a> <<a href="mailto:mmosley@uncc.edu" target="_blank">mmosley@uncc.edu</a>>*<br>

<br>

<br>

<br>

-- <br>

GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A<br>

<br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div style="font-size:12.8px"><div dir="ltr"><div><span style="font-family:"times new roman",serif"><b><i>J. Michael Mosley</i></b><br>University Research Computing<br>The University of North Carolina at Charlotte<br>9201 University City Blvd<br>Charlotte, NC  28223<br><u>704.687.7065 </u>    <u> j<i><a href="mailto:mmosley@uncc.edu" target="_blank">mmosley@uncc.edu</a></i></u></span></div></div></div></div></div>