[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Matthew BETTINGER matthew.bettinger at external.total.com
Wed Oct 9 13:58:12 UTC 2019


Just curious if this option or oom setting (which we use) can leave the nodes in CG  "completing" state.  We have CG states quite often and only way is to reboot the node.  I believe it occurs when parent process dies or gets killed or Z?  Thanks.

MB

On 10/8/19, 6:11 AM, "slurm-users on behalf of Bjørn-Helge Mevik" <slurm-users-bounces at lists.schedmd.com on behalf of b.h.mevik at usit.uio.no> wrote:

    Marcus Boden <mboden at gwdg.de> writes:
    
    > you're looking for KillOnBadExit in the slurm.conf:
    > KillOnBadExit
    
    [...]
    
    > this should terminate the job if a step or a process gets oom-killed.
    
    That is a good tip!
    
    But as I read the documentation (I haven't tested it), it will only kill
    the job step itself, it will not kill the whole job.  Also, it will only
    have effect for things started with srun, mpirun or similar.  However,
    in combination with "set -o errexit", I believe most OOM kills would get
    the job itself terminated.
    
    -- 
    Regards,
    Bjørn-Helge Mevik, dr. scient,
    Department for Research Computing, University of Oslo
    



More information about the slurm-users mailing list