[slurm-users] Job still running after process completed

John Hearns hearnsj at googlemail.com
Mon Apr 23 08:47:59 MDT 2018


*Caedite eos. Novit enim Dominus qui sunt eius*
https://en.wikipedia.org/wiki/Caedite_eos._Novit_enim_Dominus_qui_sunt_eius.

I have been wanting to use that line in the context of batch systems and
users for ages.
At least now I can make it a play on killing processes.  Rather than being
put on a watch list for admins likely to go postal.

ps. that URL really does have a period character on the end.

On 23 April 2018 at 16:18, Chris Samuel <chris at csamuel.org> wrote:

> On Monday, 23 April 2018 11:58:56 PM AEST Paul Edmon wrote:
>
> > I would recommend putting a clean up process in your epilog script.
>
> Instead of that I'd recommend using cgroups to constrain processes to the
> resources they have requested, it has the useful side effect of being able
> to
> track all children of the job on that node.   The one way some things
> escape
> is if they SSH into other nodes, to stop that use pam_slurm_adopt to
> capture
> those processes into the "extern" cgroup.
>
> More on using pam_slurm_adopt here:
>
> https://slurm.schedmd.com/pam_slurm_adopt.html
>
> > We have a check here that sees if the job completed and if so it then
> > terminates all the user processes by kill -9 to clean up any residuals.
>
> That can be dangerous if you permit jobs to share nodes (which is pretty
> standard down here in Australia) as you could end up killing processes
> from
> other jobs on that same node.
>
> All the best,
> Chris
> --
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/66bd64f4/attachment-0001.html>


More information about the slurm-users mailing list