<div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:monospace">Hi</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">I don't know what version of Slurm you're using or how it may be different from the one I'm using (18.05), but here's my understanding of memory limits and what I'm seeing on our cluster. The parameter `JobAcctGatherParams=OverMemoryKill` controls whether a step is killed if it goes over the requested memory. Note that `NoOverMemoryKill` is deprecated (at least in Slurm 18) as the default is that job accounting won't kill steps that go over the limit. We are using the default (so no killing of job steps) and I've been able to verify this is the case. There is another parameter- `MemLimitEnforce`- default is no, and we're using the default. Again, I've run jobs with very small memory limits and have seen that Slurm isn't killing jobs or steps even when the limit is exceeded.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">I don't know what your setting for select type is- we're using `CR_CPU`, so memory doesn't get calculated in job allocation. The biggest bites for us with this configuration is that- since jobs share nodes- we have relatively frequent instances where jobs run into OOM conditions since we aren't using cgroups. I don't know about the long term sustainability of having mismatched slurm.conf files.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">So I'd check those settings and what your version of Slurm sets for default. I seem to be getting your desired behavior with those settings and Slurm 18. Hope this helps</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">Michael</div><div class="gmail_default" style="font-family:monospace"><br></div><input name="virtru-metadata" type="hidden" value="{"email-policy":{"state":"closed","expirationUnit":"days","disableCopyPaste":false,"disablePrint":false,"disableForwarding":false,"enableNoauth":false,"expires":false,"isManaged":false},"attachments":{},"compose-id":"3","compose-window":{"secure":false}}"></div></div><br><div class="gmail_quote" style=""><div dir="ltr" class="gmail_attr">On Sat, Feb 23, 2019 at 9:25 PM Aurélien Vallée <<a href="mailto:vallee.aurelien@gmail.com">vallee.aurelien@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
I am in the situation where evaluating the precise memory consumption of jobs beforehand is pretty challenging. So I would like to create a “trust” system, meaning that the requested memory for jobs is taken into account for scheduling, but no action is taken if the job actually breach the limit once running on the node.<br>
I tried to use NoOverMemoryKill but it seems to work only for sbatch, not srun.<br>
So I ended up declaring memory as an un-consumable resource on the slurm.conf of nodes, but not on the master. This seems to work, but looks rather hackish (and slurm complains of the discrepancy in configuration)<br>
Is this a supported practice? Can it bite me later on? Is there a cleaner solution?<br>
</blockquote></div></div>