[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)
rp1060 at york.ac.uk
Tue Apr 20 16:26:11 UTC 2021
Submission did succeed when I put the -K0 inside the srun command within my
job script though. Will be a while before my job runs though, so won't know
for a little while whether the KillOnBadExit flag has helped.
On Tue, 20 Apr 2021 at 16:57, Robert Peck <rp1060 at york.ac.uk> wrote:
> P.S. the slurm version here is 20.02.3
> On Tue, 20 Apr 2021 at 16:55, Robert Peck <rp1060 at york.ac.uk> wrote:
>> Chris: thanks for that tip, I'm having a look at that now, it sounds
>> Run on the login node I get:
>> scontrol show config | fgrep KillOnBadExit
>> KillOnBadExit = 0
>> I've tried to put -K0 in to a job to see if that helps.
>> But doing it on the command line
>> sbatch -K0 job_name.job
>> gives an error
>> sbatch: invalid option -- 'K'
>> and putting
>> #SBATCH --kill-on-bad-exit=0
>> in the top of the .job file gives the error
>> sbatch: unrecognized option '--kill-on-bad-exit=0'
>> Loris, thanks also. If Chris's tip can't solve my issues I'll post more
>> detaeeld discussions of the software I'm working with, but it can get quite
>> confusing and took me a long time to get this software running on the
>> cluster even in the "one per node" form I currently use it, hence my
>> preference to tinker with SLURM settings rather than try to change the
>> On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <chris at csamuel.org>
>>> Hi Robert,
>>> On 4/16/21 12:39 pm, Robert Peck wrote:
>>> > Please can anyone suggest how to instruct SLURM not to massacre ALL my
>>> > jobs because ONE (or a few) node(s) fails?
>>> You will also probably want this for your srun: --kill-on-bad-exit=0
>>> What does the scontrol command below show?
>>> scontrol show config | fgrep KillOnBadExit
>>> From the manual page:
>>> -K, --kill-on-bad-exit[=0|1]
>>> Controls whether or not to terminate a step if any task
>>> exits with a non-zero exit code. If this option is not
>>> specified, the default action will be based upon
>>> the Slurm configuration parameter of KillOnBadExit.
>>> If this option is specified, it will take precedence over
>>> KillOnBadExit. An option argument of zero will not
>>> terminate the job. A non-zero argument or no argument
>>> will terminate the job. Note: This option takes
>>> precedence over the -W, --wait option to terminate the
>>> job immediately if a task exits with a non-zero exit
>>> code. Since this option's argument is optional, for
>>> proper parsing the single letter option must be followed
>>> immediately with the value and not include a space between
>>> them. For example "-K1" and not "-K 1".
>>> Best of luck,
>>> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "slurm-users" group.
>>> To unsubscribe from this topic, visit
>>> To unsubscribe from this group and all its topics, send an email to
>>> slurm-users+unsubscribe at googlegroups.com.
>>> To view this discussion on the web visit
>> Robert Peck
>> *Robot Lab*
>> *Intelligent Systems and Nanoscience group*
>> *Department of Electronic Engineering*
>> *University of York*
> Robert Peck
> *Robot Lab*
> *Intelligent Systems and Nanoscience group*
> *Department of Electronic Engineering*
> *University of York*
*Intelligent Systems and Nanoscience group*
*Department of Electronic Engineering*
*University of York*
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users