<div dir="ltr">P.S. the slurm version here is 20.02.3</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 20 Apr 2021 at 16:55, Robert Peck <<a href="mailto:rp1060@york.ac.uk">rp1060@york.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Chris: thanks for that tip, I'm having a look at that now, it sounds promising.<br><br>Run on the login node I get:<br>scontrol show config | fgrep KillOnBadExit<br>KillOnBadExit = 0<br><br>I've tried to put -K0 in to a job to see if that helps.<br>But doing it on the command line<br>sbatch -K0 job_name.job<br>gives an error<br>sbatch: invalid option -- 'K'<br>and putting<br>#SBATCH --kill-on-bad-exit=0<br>in the top of the .job file gives the error<br>sbatch: unrecognized option '--kill-on-bad-exit=0'<div><br></div><div><br></div><div><br></div><div><br><br>Loris, thanks also. If Chris's tip can't solve my issues I'll post more detaeeld discussions of the software I'm working with, but it can get quite confusing and took me a long time to get this software running on the cluster even in the "one per node" form I currently use it, hence my preference to tinker with SLURM settings rather than try to change the software.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 20 Apr 2021 at 06:53, Christopher Samuel <<a href="mailto:chris@csamuel.org" target="_blank">chris@csamuel.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Robert,<br>
<br>
On 4/16/21 12:39 pm, Robert Peck wrote:<br>
<br>
> Please can anyone suggest how to instruct SLURM not to massacre ALL my <br>
> jobs because ONE (or a few) node(s) fails?<br>
<br>
You will also probably want this for your srun: --kill-on-bad-exit=0<br>
<br>
What does the scontrol command below show?<br>
<br>
scontrol show config | fgrep KillOnBadExit<br>
<br>
From the manual page:<br>
<br>
-K, --kill-on-bad-exit[=0|1]<br>
Controls whether or not to terminate a step if any task<br>
exits with a non-zero exit code. If this option is not<br>
specified, the default action will be based upon<br>
the Slurm configuration parameter of KillOnBadExit.<br>
If this option is specified, it will take precedence over<br>
KillOnBadExit. An option argument of zero will not<br>
terminate the job. A non-zero argument or no argument<br>
will terminate the job. Note: This option takes<br>
precedence over the -W, --wait option to terminate the<br>
job immediately if a task exits with a non-zero exit<br>
code. Since this option's argument is optional, for<br>
proper parsing the single letter option must be followed<br>
immediately with the value and not include a space between<br>
them. For example "-K1" and not "-K 1".<br>
<br>
<br>
Best of luck,<br>
Chris<br>
-- <br>
Chris Samuel : <a href="http://www.csamuel.org/" rel="noreferrer" target="_blank">http://www.csamuel.org/</a> : Berkeley, CA, USA<br>
<br>
-- <br>
You received this message because you are subscribed to a topic in the Google Groups "slurm-users" group.<br>
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe" rel="noreferrer" target="_blank">https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe</a>.<br>
To unsubscribe from this group and all its topics, send an email to <a href="mailto:slurm-users%2Bunsubscribe@googlegroups.com" target="_blank">slurm-users+unsubscribe@googlegroups.com</a>.<br>
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org" rel="noreferrer" target="_blank">https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org</a>.<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr"><div dir="ltr"><div><div><font size="2">Thanks</font></div><div dir="ltr"><font size="2"><br></font></div><div dir="ltr"><font size="2">----------</font></div><div dir="ltr"><font size="2"><br></font></div><div dir="ltr"><font size="2">Robert Peck</font><div><font size="4" color="#cccccc"><i><br></i></font><div><i><font color="#cccccc">Robot Lab<br></font></i><div><i><font color="#cccccc">Intelligent Systems and Nanoscience group</font></i></div></div><div><i><font color="#cccccc">Department of Electronic Engineering</font></i></div><div><i><font color="#cccccc">University of York</font></i></div></div></div></div></div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div><font size="2">Thanks</font></div><div dir="ltr"><font size="2"><br></font></div><div dir="ltr"><font size="2">----------</font></div><div dir="ltr"><font size="2"><br></font></div><div dir="ltr"><font size="2">Robert Peck</font><div><font size="4" color="#cccccc"><i><br></i></font><div><i><font color="#cccccc">Robot Lab<br></font></i><div><i><font color="#cccccc">Intelligent Systems and Nanoscience group</font></i></div></div><div><i><font color="#cccccc">Department of Electronic Engineering</font></i></div><div><i><font color="#cccccc">University of York</font></i></div></div></div></div></div></div>