[slurm-users] strigger on CG, completing state

Matthew BETTINGER matthew.bettinger at external.total.com
Wed May 29 13:42:01 UTC 2019


Ok thanks we will look into that!  Thought we were the only ones who had the problem and yes it's like windows 98SE,  you can try all you want but eventually we end up rebooting the nodes.  Interns are starting to show up and you know they can bend a cluster in ways you never seen before.  We will investigate this as this looks like a more proactive approach instead of walking in the morning and seeing 100's of nodes stuck in CG because intern didn't tear down their jupyter sessions in a sane way.

On 5/29/19, 3:02 AM, "slurm-users on behalf of Yair Yarom" <slurm-users-bounces at lists.schedmd.com on behalf of irush at cs.huji.ac.il> wrote:

    Hi,
    
    
    Check the UnkillableStepProgram and UnkillableStepTimeout options in slurm.conf.
    We use it to drain the stuck nodes and mail us - as here, usually stuck processes will require a reboot. As the drained strigger will never get triggered, we also set a finished trigger for the next RUNNING job. That trigger will either send us mail if
     there are only stuck processes, or strigger --fini the next RUNNING job.
    
    
    
    
        Yair.
    
    
    
    
    On Tue, May 28, 2019 at 7:58 PM mercan <ahmet.mercan at uhem.itu.edu.tr> wrote:
    
    
    Hi;
    
    If you did not use the epilog script, you can set the epilog script to 
    clean up all residues from the finished jobs:
    
    https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts
    
    Ahmet M.
    
    
    28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
    > We use triggers for the obvious alerts but is that a way to make a trigger for nodes stuck in CG (completing) state?  Some user jobs, mostly Julia notebook can get hung in completing state is the user kills the running job or cancels it with cntrl.  When
     this happens we can have many many nodes stuck in CG.  Slurm 17.02.6.  Thanks!
    >
    
    
    
    
    
    
    -- 
      /|       |
      \/       | Yair Yarom | Senior DevOps Architect
      []       | The Rachel and Selim Benin School
      [] /\    | of Computer Science and Engineering
      []//\\/  | The Hebrew University of Jerusalem
      [//  \\  | T +972-2-5494522 | F +972-2-5494522
      //    \  | irush at cs.huji.ac.il
     //        |
    
    
    
    



More information about the slurm-users mailing list