[slurm-users] Job still running after process completed
Chris Samuel
chris at csamuel.org
Mon Apr 23 08:18:16 MDT 2018
On Monday, 23 April 2018 11:58:56 PM AEST Paul Edmon wrote:
> I would recommend putting a clean up process in your epilog script.
Instead of that I'd recommend using cgroups to constrain processes to the
resources they have requested, it has the useful side effect of being able to
track all children of the job on that node. The one way some things escape
is if they SSH into other nodes, to stop that use pam_slurm_adopt to capture
those processes into the "extern" cgroup.
More on using pam_slurm_adopt here:
https://slurm.schedmd.com/pam_slurm_adopt.html
> We have a check here that sees if the job completed and if so it then
> terminates all the user processes by kill -9 to clean up any residuals.
That can be dangerous if you permit jobs to share nodes (which is pretty
standard down here in Australia) as you could end up killing processes from
other jobs on that same node.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the slurm-users
mailing list