[slurm-users] Memory oversubscription and sheduling

Thu May 10 09:00:24 MDT 2018

On Thursday, 10 May 2018, at 20:02:37 (+1000),
Chris Samuel wrote:

> For instance there's the LBNL Node Health Check (NHC) system that plugs into 
> both Slurm and Torque.
> 
> https://slurm.schedmd.com/SUG14/node_health_check.pdf
> 
> https://github.com/mej/nhc
> 
> At ${JOB-1} we would run our in-house health check from cron and write to a 
> file in /dev/shm so that all the actual Slurm health check script would do is 
> send that to Slurm (and raise an error if it was missing).    This was because 
> we used to see health checks block due to issues and so slurmd would lock up 
> running them.   Decoupling them fixed that.

I'm surprised to hear that; this is the first time I've ever heard
that in regards to SLURM.  I'd only ever heard folks complain about
TORQUE having that issue.

FWIW, due to this exact situation, LBNL NHC has a built-in feature
called "Detached Mode" that will do something very similar, but it's
all self-contained within NHC.  The foreground process will fork off a
child and then return the results from the previous run while the
child goes off, runs all the tests, and stores its results to a file
for the next health check cycle to return.

You can read more about it here:  https://github.com/mej/nhc#detached-mode

HTH,
Michael

PS:  Hi Chris!  :-D

-- 
Michael E. Jennings <mej at lanl.gov>
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341     W: +1 (505) 606-0605