[slurm-users] KillOnBadExit or srun's -K: step, job, task, process all get a mention in dispatches

Kevin Buckley Kevin.Buckley at pawsey.org.au
Tue May 19 09:10:11 UTC 2020


I was actually looking at something else (tm) when I noticed that
two of our Slurm controlled resources had different config values
for KillOnBadExit, and so I went looking for clues.


I read this:

KillOnBadExit

     If set to 1, a step will be terminated immediately if any task is
     crashed or aborted, as indicated by a non-zero exit code.

     With the default value of 0, if one of the processes is crashed
     or aborted the other processes will continue to run while the
     crashed or aborted process waits.

     The user can override this configuration parameter by using srun's
     -K, --kill-on-bad-exit.


and thought that if, in my mind, I replaced "process(es)" with "task(s)",
it made sense, but of course, I had to go and RTFsrunM, didn't I, vis:


  -K, --kill-on-bad-exit[=0|1]

     Controls whether or not to terminate a step if any task exits with
     a non-zero exit code.

     If this option is not specified, the default action will be based
     upon the Slurm configuration parameter of KillOnBadExit. If this
     option is specified, it will take precedence over KillOnBadExit.

     An option argument of zero will not terminate the job. A non-zero
     argument or no argument will terminate the job.

     Note: This option takes precedence over the -W, --wait option to
     terminate the job immediately if a task exits with a non-zero exit
     code.

     Since this option's argument is optional, for proper parsing the
     single letter option must be followed immediately with the value
     and not include a space between them. For example "-K1" and not
     "-K 1".


so now we're talking about the "job", as well as a "step" within
a job?

Then again, one could read that as the config setting only bins the
step the task was in, but then the srun flag isn't overiding the
config settting (per-step), it's escalating the bin-on-any-failure
to the job level?

Then again, srun's "bare kill-on-bad-exit" might be thought of as
"overriding the config" but only to the extent that it can turn
a config (per-step) of 0 into a config (per-step) of 1, by being
there, but not the other way around, because there isn't any
--no-kill-on-bad-exit ?

And both of those suggest that the config can't be used to set a
kill of a whole job, only a step but, if you want to do that, the
srun man-page points out you can use -W, but suggests that that -K
will override that too.

So now I think I've gone two steps forwards; one job back: but where
am I really?


Is there a possible future, with a

TaskFailureAction = Ignore|KillStep|KillJob(|KillJobArray?)

config value, along with an associated

--task-failure-action=[0|1|2(|3)]

command-line option, in it, as that would seem to offer a clearer
"this overrides that" mapping?

Then again, as this wasn't what I was originally looking for/at,
maybe I've missed something.

Kevin Buckley
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



More information about the slurm-users mailing list