[slurm-users] Memory oversubscription and sheduling

Thu May 10 04:02:37 MDT 2018

On Monday, 7 May 2018 11:58:38 PM AEST Cory Holcomb wrote:

> Thank you, for the reply  I was beginning to wonder if my message was seen.

It's a busy list at times. :-)

> While I understand how batch systems work, if you have a system daemon that
> develops a memory leak and consumes the memory outside of allocation.

Understood.

> Not checking the used memory on the box before dispatch seems like a good
> way to black hole a bunch of jobs.

This is why Slurm has support for healthcheck scripts that can run regularly 
as well as before/after a job is launched.  These can knock nodes offline.  It's 
documented in the slurm.conf manual page.

For instance there's the LBNL Node Health Check (NHC) system that plugs into 
both Slurm and Torque.

https://slurm.schedmd.com/SUG14/node_health_check.pdf

https://github.com/mej/nhc

At ${JOB-1} we would run our in-house health check from cron and write to a 
file in /dev/shm so that all the actual Slurm health check script would do is 
send that to Slurm (and raise an error if it was missing).    This was because 
we used to see health checks block due to issues and so slurmd would lock up 
running them.   Decoupling them fixed that.

Best of luck,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC