<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body smarttemplateinserted="true">
Hi All,<br>
<br>
Just checking to see if this sounds familiar to anyone.<br>
<br>
Environment:<br>
- CentOS 7.5 x86_64<br>
- Slurm 17.11.10 (but this also happened with 17.11.5)<br>
<br>
We typically run about 100 tests/night, selected from a handful of
favorites. For roughly 1 in 300 test runs, we see one of two
mysterious failures:<br>
<br>
1. The 5 minute cancellation<br>
<br>
A job will be rolling along, generating it's expected output, and
then this message appears:<br>
<blockquote>srun: forcing job termination<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***<br>
srun: error: nodename: task 250: Terminated<br>
srun: Terminating job step 3531.0<br>
</blockquote>
sacct reports<br>
<blockquote><tt> JobID Start End
ExitCode State </tt><br>
<tt>------------ ------------------- ------------------- --------
---------- </tt><br>
<tt>3418 2019-01-29T05:54:07 2019-01-29T05:59:16
0:9 FAILED</tt><br>
</blockquote>
These failures consistently happen at just about 5 minutes into the
run when they happen.<br>
<br>
2. The random cancellation<br>
<br>
As above, a job will be generating the expected output, and then we
see<br>
<blockquote>srun: forcing job termination<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***<br>
srun: error: nodename: task 250: Terminated<br>
srun: Terminating job step 3531.0<br>
</blockquote>
But this time, sacct reports<br>
<blockquote><tt> JobID Start End
ExitCode State </tt><br>
<tt>------------ ------------------- ------------------- --------
---------- </tt><br>
<tt>3531 2019-01-30T07:21:25 2019-01-30T07:35:50 0:0
COMPLETED </tt><br>
<tt>3531.0 2019-01-30T07:21:27 2019-01-30T07:35:56 0:15
CANCELLED </tt><br>
</blockquote>
I think we've seen these cancellations pop up as soon as a minute or
two into the test run, up to perhaps 20 minutes into the run.<br>
<br>
The only thing slightly unusual in our job submissions is that we
use srun's "--immediate=120" so that the scripts can respond
appropriately if a node goes down.<br>
<br>
With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a
clue in the slurmctld or slurmd logs.<br>
<br>
Any thoughts on what might be happening, or what I might try next?<br>
<br>
Andy<br>
<br>
<pre class="moz-signature" cols="72">--
Andy Riebs
<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
</pre>
</body>
</html>