[slurm-users] possible to set memory slack space before killing jobs?

Michael Robbert mrobbert at mines.edu
Mon Dec 10 09:33:50 MST 2018


If you want to detect lost DIMMs or anything like that use a Node Health 
Check script. I recommend and use this one: https://github.com/mej/nhc

It has an option to generate a configuration file that will watch way 
more than you probably need, but if you want to know if something on 
your nodes has changed from what was there yesterday that is the way to go.

Mike

On 12/7/18 3:15 AM, Bjørn-Helge Mevik wrote:
> Eli V <eliventer at gmail.com> writes:
>
>> On Wed, Dec 5, 2018 at 5:04 PM Bjørn-Helge Mevik <b.h.mevik at usit.uio.no> wrote:
>>> I don't think Slurm has any facility for soft memory limits.
>>>
>>> But you could emulate it by simply configure the nodes in slurm.conf
>>> with, e.g., 15% higher RealMemory value than what is actually available
>>> on the node.  Then a node with 256 GiB RAM would be able to run 9 jobs,
>>> each asking for 32 GiB RAM.
>>>
>>> (You wouldn't get the effect that a job would be allowed to exceed its
>>> soft limit for a set amount of time before getting killed, though.)
>> I don't think this is possible currently. From my experience slurm
>> will auto drain a node if it's actual physical memory is less then
>> what's defined for it in the slurm.conf.
> True.  I forgot about that.  You could prevent slurm from draining them
> by setting FastSchedule=2 in slurm.conf, but then you wouldn't detect
> nodes losing RAM (which does happen from time to time).
>


More information about the slurm-users mailing list