[slurm-users] Job still running after process completed

Chris Samuel chris at csamuel.org
Mon Apr 23 08:18:16 MDT 2018


On Monday, 23 April 2018 11:58:56 PM AEST Paul Edmon wrote:

> I would recommend putting a clean up process in your epilog script.

Instead of that I'd recommend using cgroups to constrain processes to the 
resources they have requested, it has the useful side effect of being able to 
track all children of the job on that node.   The one way some things escape 
is if they SSH into other nodes, to stop that use pam_slurm_adopt to capture 
those processes into the "extern" cgroup.

More on using pam_slurm_adopt here:

https://slurm.schedmd.com/pam_slurm_adopt.html

> We have a check here that sees if the job completed and if so it then
> terminates all the user processes by kill -9 to clean up any residuals.

That can be dangerous if you permit jobs to share nodes (which is pretty 
standard down here in Australia) as you could end up killing processes from 
other jobs on that same node.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC




More information about the slurm-users mailing list