[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Robert Peck rp1060 at york.ac.uk
Tue Apr 20 15:57:29 UTC 2021


P.S. the slurm version here is 20.02.3

On Tue, 20 Apr 2021 at 16:55, Robert Peck <rp1060 at york.ac.uk> wrote:

> Chris: thanks for that tip, I'm having a look at that now, it sounds
> promising.
>
> Run on the login node I get:
> scontrol show config | fgrep KillOnBadExit
> KillOnBadExit           = 0
>
> I've tried to put  -K0 in to a job to see if that helps.
> But doing it on the command line
> sbatch -K0 job_name.job
> gives an error
> sbatch: invalid option -- 'K'
> and putting
> #SBATCH --kill-on-bad-exit=0
> in the top of the .job file gives the error
> sbatch: unrecognized option '--kill-on-bad-exit=0'
>
>
>
>
>
> Loris, thanks also. If Chris's tip can't solve my issues I'll post more
> detaeeld discussions of the software I'm working with, but it can get quite
> confusing and took me a long time to get this software running on the
> cluster even in the "one per node" form I currently use it, hence my
> preference to tinker with SLURM settings rather than try to change the
> software.
>
> On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <chris at csamuel.org>
> wrote:
>
>> Hi Robert,
>>
>> On 4/16/21 12:39 pm, Robert Peck wrote:
>>
>> > Please can anyone suggest how to instruct SLURM not to massacre ALL my
>> > jobs because ONE (or a few) node(s) fails?
>>
>> You will also probably want this for your srun: --kill-on-bad-exit=0
>>
>> What does the scontrol command below show?
>>
>> scontrol show config | fgrep KillOnBadExit
>>
>>  From the manual page:
>>
>>         -K, --kill-on-bad-exit[=0|1]
>>                Controls whether or not to terminate a step if any task
>>                exits with a non-zero exit code. If this option is not
>>                specified, the default action will  be  based  upon
>>                the  Slurm  configuration parameter of KillOnBadExit.
>>                If this option is specified, it will take precedence over
>>                KillOnBadExit. An option argument of zero will not
>>                terminate the job. A non-zero argument or no argument
>>                will terminate the job.  Note: This option takes
>>                precedence over the -W, --wait option to terminate the
>>                job immediately  if  a  task  exits with a non-zero exit
>>                code.  Since this option's argument is optional, for
>>                proper parsing the single letter option must be followed
>>                immediately with the value and not include a space between
>>                them. For example "-K1" and not "-K 1".
>>
>>
>> Best of luck,
>> Chris
>> --
>>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "slurm-users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> slurm-users+unsubscribe at googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org
>> .
>>
>
>
> --
> Thanks
>
> ----------
>
> Robert Peck
>
>
> *Robot Lab*
> *Intelligent Systems and Nanoscience group*
> *Department of Electronic Engineering*
> *University of York*
>


-- 
Thanks

----------

Robert Peck


*Robot Lab*
*Intelligent Systems and Nanoscience group*
*Department of Electronic Engineering*
*University of York*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210420/6f67f6ee/attachment.htm>


More information about the slurm-users mailing list