[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Robert Peck rp1060 at york.ac.uk
Tue Apr 20 15:55:03 UTC 2021


Chris: thanks for that tip, I'm having a look at that now, it sounds
promising.

Run on the login node I get:
scontrol show config | fgrep KillOnBadExit
KillOnBadExit           = 0

I've tried to put  -K0 in to a job to see if that helps.
But doing it on the command line
sbatch -K0 job_name.job
gives an error
sbatch: invalid option -- 'K'
and putting
#SBATCH --kill-on-bad-exit=0
in the top of the .job file gives the error
sbatch: unrecognized option '--kill-on-bad-exit=0'





Loris, thanks also. If Chris's tip can't solve my issues I'll post more
detaeeld discussions of the software I'm working with, but it can get quite
confusing and took me a long time to get this software running on the
cluster even in the "one per node" form I currently use it, hence my
preference to tinker with SLURM settings rather than try to change the
software.

On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <chris at csamuel.org> wrote:

> Hi Robert,
>
> On 4/16/21 12:39 pm, Robert Peck wrote:
>
> > Please can anyone suggest how to instruct SLURM not to massacre ALL my
> > jobs because ONE (or a few) node(s) fails?
>
> You will also probably want this for your srun: --kill-on-bad-exit=0
>
> What does the scontrol command below show?
>
> scontrol show config | fgrep KillOnBadExit
>
>  From the manual page:
>
>         -K, --kill-on-bad-exit[=0|1]
>                Controls whether or not to terminate a step if any task
>                exits with a non-zero exit code. If this option is not
>                specified, the default action will  be  based  upon
>                the  Slurm  configuration parameter of KillOnBadExit.
>                If this option is specified, it will take precedence over
>                KillOnBadExit. An option argument of zero will not
>                terminate the job. A non-zero argument or no argument
>                will terminate the job.  Note: This option takes
>                precedence over the -W, --wait option to terminate the
>                job immediately  if  a  task  exits with a non-zero exit
>                code.  Since this option's argument is optional, for
>                proper parsing the single letter option must be followed
>                immediately with the value and not include a space between
>                them. For example "-K1" and not "-K 1".
>
>
> Best of luck,
> Chris
> --
>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "slurm-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> slurm-users+unsubscribe at googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org
> .
>


-- 
Thanks

----------

Robert Peck


*Robot Lab*
*Intelligent Systems and Nanoscience group*
*Department of Electronic Engineering*
*University of York*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210420/c741f649/attachment.htm>


More information about the slurm-users mailing list