[slurm-users] strigger on CG, completing state

Yair Yarom irush at cs.huji.ac.il
Wed May 29 08:01:31 UTC 2019


Hi,

Check the UnkillableStepProgram and UnkillableStepTimeout options in
slurm.conf.
We use it to drain the stuck nodes and mail us - as here, usually stuck
processes will require a reboot. As the drained strigger will never get
triggered, we also set a finished trigger for the next RUNNING job. That
trigger will either send us mail if there are only stuck processes, or
strigger --fini the next RUNNING job.

    Yair.


On Tue, May 28, 2019 at 7:58 PM mercan <ahmet.mercan at uhem.itu.edu.tr> wrote:

> Hi;
>
> If you did not use the epilog script, you can set the epilog script to
> clean up all residues from the finished jobs:
>
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts
>
> Ahmet M.
>
>
> 28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
> > We use triggers for the obvious alerts but is that a way to make a
> trigger for nodes stuck in CG (completing) state?  Some user jobs, mostly
> Julia notebook can get hung in completing state is the user kills the
> running job or cancels it with cntrl.  When this happens we can have many
> many nodes stuck in CG.  Slurm 17.02.6.  Thanks!
> >
>
>

-- 

  /|       |
  \/       | Yair Yarom | Senior DevOps Architect
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | irush at cs.huji.ac.il
 //        |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190529/4b01650f/attachment.html>


More information about the slurm-users mailing list