<div dir="ltr"><div>Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires through job.</div><div><br></div><div>Doug<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
Hi All,<br>
<br>
Just checking to see if this sounds familiar to anyone.<br>
<br>
Environment:<br>
- CentOS 7.5 x86_64<br>
- Slurm 17.11.10 (but this also happened with 17.11.5)<br>
<br>
We typically run about 100 tests/night, selected from a handful of
favorites. For roughly 1 in 300 test runs, we see one of two
mysterious failures:<br>
<br>
1. The 5 minute cancellation<br>
<br>
A job will be rolling along, generating it's expected output, and
then this message appears:<br>
<blockquote>srun: forcing job termination<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***<br>
srun: error: nodename: task 250: Terminated<br>
srun: Terminating job step 3531.0<br>
</blockquote>
sacct reports<br>
<blockquote><tt> JobID Start End
ExitCode State </tt><br>
<tt>------------ ------------------- ------------------- --------
---------- </tt><br>
<tt>3418 2019-01-29T05:54:07 2019-01-29T05:59:16
0:9 FAILED</tt><br>
</blockquote>
These failures consistently happen at just about 5 minutes into the
run when they happen.<br>
<br>
2. The random cancellation<br>
<br>
As above, a job will be generating the expected output, and then we
see<br>
<blockquote>srun: forcing job termination<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***<br>
srun: error: nodename: task 250: Terminated<br>
srun: Terminating job step 3531.0<br>
</blockquote>
But this time, sacct reports<br>
<blockquote><tt> JobID Start End
ExitCode State </tt><br>
<tt>------------ ------------------- ------------------- --------
---------- </tt><br>
<tt>3531 2019-01-30T07:21:25 2019-01-30T07:35:50 0:0
COMPLETED </tt><br>
<tt>3531.0 2019-01-30T07:21:27 2019-01-30T07:35:56 0:15
CANCELLED </tt><br>
</blockquote>
I think we've seen these cancellations pop up as soon as a minute or
two into the test run, up to perhaps 20 minutes into the run.<br>
<br>
The only thing slightly unusual in our job submissions is that we
use srun's "--immediate=120" so that the scripts can respond
appropriately if a node goes down.<br>
<br>
With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a
clue in the slurmctld or slurmd logs.<br>
<br>
Any thoughts on what might be happening, or what I might try next?<br>
<br>
Andy<br>
<br>
<pre class="gmail-m_-5512260739653731564moz-signature" cols="72">--
Andy Riebs
<a class="gmail-m_-5512260739653731564moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" target="_blank">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
</pre>
</div>
</blockquote></div>