[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)
Christopher Samuel
chris at csamuel.org
Tue Apr 20 05:53:09 UTC 2021
Hi Robert,
On 4/16/21 12:39 pm, Robert Peck wrote:
> Please can anyone suggest how to instruct SLURM not to massacre ALL my
> jobs because ONE (or a few) node(s) fails?
You will also probably want this for your srun: --kill-on-bad-exit=0
What does the scontrol command below show?
scontrol show config | fgrep KillOnBadExit
From the manual page:
-K, --kill-on-bad-exit[=0|1]
Controls whether or not to terminate a step if any task
exits with a non-zero exit code. If this option is not
specified, the default action will be based upon
the Slurm configuration parameter of KillOnBadExit.
If this option is specified, it will take precedence over
KillOnBadExit. An option argument of zero will not
terminate the job. A non-zero argument or no argument
will terminate the job. Note: This option takes
precedence over the -W, --wait option to terminate the
job immediately if a task exits with a non-zero exit
code. Since this option's argument is optional, for
proper parsing the single letter option must be followed
immediately with the value and not include a space between
them. For example "-K1" and not "-K 1".
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list