[slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

Robert Peck rp1060 at york.ac.uk
Tue Apr 20 16:26:11 UTC 2021


Submission did succeed when I put the -K0 inside the srun command within my
job script though. Will be a while before my job runs though, so won't know
for a little while whether the KillOnBadExit flag has helped.

On Tue, 20 Apr 2021 at 16:57, Robert Peck <rp1060 at york.ac.uk> wrote:

> P.S. the slurm version here is 20.02.3
>
> On Tue, 20 Apr 2021 at 16:55, Robert Peck <rp1060 at york.ac.uk> wrote:
>
>> Chris: thanks for that tip, I'm having a look at that now, it sounds
>> promising.
>>
>> Run on the login node I get:
>> scontrol show config | fgrep KillOnBadExit
>> KillOnBadExit           = 0
>>
>> I've tried to put  -K0 in to a job to see if that helps.
>> But doing it on the command line
>> sbatch -K0 job_name.job
>> gives an error
>> sbatch: invalid option -- 'K'
>> and putting
>> #SBATCH --kill-on-bad-exit=0
>> in the top of the .job file gives the error
>> sbatch: unrecognized option '--kill-on-bad-exit=0'
>>
>>
>>
>>
>>
>> Loris, thanks also. If Chris's tip can't solve my issues I'll post more
>> detaeeld discussions of the software I'm working with, but it can get quite
>> confusing and took me a long time to get this software running on the
>> cluster even in the "one per node" form I currently use it, hence my
>> preference to tinker with SLURM settings rather than try to change the
>> software.
>>
>> On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <chris at csamuel.org>
>> wrote:
>>
>>> Hi Robert,
>>>
>>> On 4/16/21 12:39 pm, Robert Peck wrote:
>>>
>>> > Please can anyone suggest how to instruct SLURM not to massacre ALL my
>>> > jobs because ONE (or a few) node(s) fails?
>>>
>>> You will also probably want this for your srun: --kill-on-bad-exit=0
>>>
>>> What does the scontrol command below show?
>>>
>>> scontrol show config | fgrep KillOnBadExit
>>>
>>>  From the manual page:
>>>
>>>         -K, --kill-on-bad-exit[=0|1]
>>>                Controls whether or not to terminate a step if any task
>>>                exits with a non-zero exit code. If this option is not
>>>                specified, the default action will  be  based  upon
>>>                the  Slurm  configuration parameter of KillOnBadExit.
>>>                If this option is specified, it will take precedence over
>>>                KillOnBadExit. An option argument of zero will not
>>>                terminate the job. A non-zero argument or no argument
>>>                will terminate the job.  Note: This option takes
>>>                precedence over the -W, --wait option to terminate the
>>>                job immediately  if  a  task  exits with a non-zero exit
>>>                code.  Since this option's argument is optional, for
>>>                proper parsing the single letter option must be followed
>>>                immediately with the value and not include a space between
>>>                them. For example "-K1" and not "-K 1".
>>>
>>>
>>> Best of luck,
>>> Chris
>>> --
>>>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>>
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "slurm-users" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> slurm-users+unsubscribe at googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org
>>> .
>>>
>>
>>
>> --
>> Thanks
>>
>> ----------
>>
>> Robert Peck
>>
>>
>> *Robot Lab*
>> *Intelligent Systems and Nanoscience group*
>> *Department of Electronic Engineering*
>> *University of York*
>>
>
>
> --
> Thanks
>
> ----------
>
> Robert Peck
>
>
> *Robot Lab*
> *Intelligent Systems and Nanoscience group*
> *Department of Electronic Engineering*
> *University of York*
>


-- 
Thanks

----------

Robert Peck


*Robot Lab*
*Intelligent Systems and Nanoscience group*
*Department of Electronic Engineering*
*University of York*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210420/34f56ec2/attachment-0001.htm>


More information about the slurm-users mailing list