[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Christopher Samuel chris at csamuel.org
Tue Apr 20 05:53:09 UTC 2021


Hi Robert,

On 4/16/21 12:39 pm, Robert Peck wrote:

> Please can anyone suggest how to instruct SLURM not to massacre ALL my 
> jobs because ONE (or a few) node(s) fails?

You will also probably want this for your srun: --kill-on-bad-exit=0

What does the scontrol command below show?

scontrol show config | fgrep KillOnBadExit

 From the manual page:

        -K, --kill-on-bad-exit[=0|1]
               Controls whether or not to terminate a step if any task
               exits with a non-zero exit code. If this option is not
               specified, the default action will  be  based  upon
               the  Slurm  configuration parameter of KillOnBadExit.
               If this option is specified, it will take precedence over
               KillOnBadExit. An option argument of zero will not
               terminate the job. A non-zero argument or no argument
               will terminate the job.  Note: This option takes
               precedence over the -W, --wait option to terminate the
               job immediately  if  a  task  exits with a non-zero exit
               code.  Since this option's argument is optional, for
               proper parsing the single letter option must be followed
               immediately with the value and not include a space between
               them. For example "-K1" and not "-K 1".


Best of luck,
Chris
-- 
   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the slurm-users mailing list