[slurm-users] Memory oversubscription and sheduling
Chris Samuel
chris at csamuel.org
Thu May 10 04:02:37 MDT 2018
On Monday, 7 May 2018 11:58:38 PM AEST Cory Holcomb wrote:
> Thank you, for the reply I was beginning to wonder if my message was seen.
It's a busy list at times. :-)
> While I understand how batch systems work, if you have a system daemon that
> develops a memory leak and consumes the memory outside of allocation.
Understood.
> Not checking the used memory on the box before dispatch seems like a good
> way to black hole a bunch of jobs.
This is why Slurm has support for healthcheck scripts that can run regularly
as well as before/after a job is launched. These can knock nodes offline. It's
documented in the slurm.conf manual page.
For instance there's the LBNL Node Health Check (NHC) system that plugs into
both Slurm and Torque.
https://slurm.schedmd.com/SUG14/node_health_check.pdf
https://github.com/mej/nhc
At ${JOB-1} we would run our in-house health check from cron and write to a
file in /dev/shm so that all the actual Slurm health check script would do is
send that to Slurm (and raise an error if it was missing). This was because
we used to see health checks block due to issues and so slurmd would lock up
running them. Decoupling them fixed that.
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the slurm-users
mailing list