[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)
rp1060 at york.ac.uk
Tue Apr 20 15:55:03 UTC 2021
Chris: thanks for that tip, I'm having a look at that now, it sounds
Run on the login node I get:
scontrol show config | fgrep KillOnBadExit
KillOnBadExit = 0
I've tried to put -K0 in to a job to see if that helps.
But doing it on the command line
sbatch -K0 job_name.job
gives an error
sbatch: invalid option -- 'K'
in the top of the .job file gives the error
sbatch: unrecognized option '--kill-on-bad-exit=0'
Loris, thanks also. If Chris's tip can't solve my issues I'll post more
detaeeld discussions of the software I'm working with, but it can get quite
confusing and took me a long time to get this software running on the
cluster even in the "one per node" form I currently use it, hence my
preference to tinker with SLURM settings rather than try to change the
On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <chris at csamuel.org> wrote:
> Hi Robert,
> On 4/16/21 12:39 pm, Robert Peck wrote:
> > Please can anyone suggest how to instruct SLURM not to massacre ALL my
> > jobs because ONE (or a few) node(s) fails?
> You will also probably want this for your srun: --kill-on-bad-exit=0
> What does the scontrol command below show?
> scontrol show config | fgrep KillOnBadExit
> From the manual page:
> -K, --kill-on-bad-exit[=0|1]
> Controls whether or not to terminate a step if any task
> exits with a non-zero exit code. If this option is not
> specified, the default action will be based upon
> the Slurm configuration parameter of KillOnBadExit.
> If this option is specified, it will take precedence over
> KillOnBadExit. An option argument of zero will not
> terminate the job. A non-zero argument or no argument
> will terminate the job. Note: This option takes
> precedence over the -W, --wait option to terminate the
> job immediately if a task exits with a non-zero exit
> code. Since this option's argument is optional, for
> proper parsing the single letter option must be followed
> immediately with the value and not include a space between
> them. For example "-K1" and not "-K 1".
> Best of luck,
> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
> You received this message because you are subscribed to a topic in the
> Google Groups "slurm-users" group.
> To unsubscribe from this topic, visit
> To unsubscribe from this group and all its topics, send an email to
> slurm-users+unsubscribe at googlegroups.com.
> To view this discussion on the web visit
*Intelligent Systems and Nanoscience group*
*Department of Electronic Engineering*
*University of York*
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users