<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
tt
{mso-style-priority:99;
font-family:"Courier New";}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:Consolas;
mso-fareast-language:EN-GB;}
span.EmailStyle20
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Given the extreme amount of output that will be generated for potentially a couple hundred job runs, I was hoping that someone would
say “Seen it, here’s how to fix it.” Guess I’ll have to go with the “high output” route.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Thanks Doug!<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US">Andy<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D;mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif"> slurm-users [mailto:slurm-users-bounces@lists.schedmd.com]
<b>On Behalf Of </b>Doug Meyer<br>
<b>Sent:</b> Thursday, January 31, 2019 8:46 PM<br>
<b>To:</b> Slurm User Community List <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires through job.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Doug<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<p class="MsoNormal">Hi All,<br>
<br>
Just checking to see if this sounds familiar to anyone.<br>
<br>
Environment:<br>
- CentOS 7.5 x86_64<br>
- Slurm 17.11.10 (but this also happened with 17.11.5)<br>
<br>
We typically run about 100 tests/night, selected from a handful of favorites. For roughly 1 in 300 test runs, we see one of two mysterious failures:<br>
<br>
1. The 5 minute cancellation<br>
<br>
A job will be rolling along, generating it's expected output, and then this message appears:<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">srun: forcing job termination<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 ***<br>
srun: error: nodename: task 250: Terminated<br>
srun: Terminating job step 3531.0<o:p></o:p></p>
</blockquote>
<p class="MsoNormal">sacct reports<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"><tt><span style="font-size:10.0pt"> JobID Start End ExitCode State
</span></tt><br>
<tt><span style="font-size:10.0pt">------------ ------------------- ------------------- -------- ----------
</span></tt><br>
<tt><span style="font-size:10.0pt">3418 2019-01-29T05:54:07 2019-01-29T05:59:16 0:9 FAILED</span></tt><o:p></o:p></p>
</blockquote>
<p class="MsoNormal">These failures consistently happen at just about 5 minutes into the run when they happen.<br>
<br>
2. The random cancellation<br>
<br>
As above, a job will be generating the expected output, and then we see<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal">srun: forcing job termination<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 ***<br>
srun: error: nodename: task 250: Terminated<br>
srun: Terminating job step 3531.0<o:p></o:p></p>
</blockquote>
<p class="MsoNormal">But this time, sacct reports<o:p></o:p></p>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"><tt><span style="font-size:10.0pt"> JobID Start End ExitCode State
</span></tt><br>
<tt><span style="font-size:10.0pt">------------ ------------------- ------------------- -------- ----------
</span></tt><br>
<tt><span style="font-size:10.0pt">3531 2019-01-30T07:21:25 2019-01-30T07:35:50 0:0 COMPLETED
</span></tt><br>
<tt><span style="font-size:10.0pt">3531.0 2019-01-30T07:21:27 2019-01-30T07:35:56 0:15 CANCELLED
</span></tt><o:p></o:p></p>
</blockquote>
<p class="MsoNormal">I think we've seen these cancellations pop up as soon as a minute or two into the test run, up to perhaps 20 minutes into the run.<br>
<br>
The only thing slightly unusual in our job submissions is that we use srun's "--immediate=120" so that the scripts can respond appropriately if a node goes down.<br>
<br>
With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in the slurmctld or slurmd logs.<br>
<br>
Any thoughts on what might be happening, or what I might try next?<br>
<br>
Andy<br>
<br>
<br>
<o:p></o:p></p>
<pre>-- <o:p></o:p></pre>
<pre>Andy Riebs<o:p></o:p></pre>
<pre><a href="mailto:andy.riebs@hpe.com" target="_blank">andy.riebs@hpe.com</a><o:p></o:p></pre>
<pre>Hewlett-Packard Enterprise<o:p></o:p></pre>
<pre>High Performance Computing Software Engineering<o:p></o:p></pre>
<pre>+1 404 648 9024<o:p></o:p></pre>
<pre>My opinions are not necessarily those of HPE<o:p></o:p></pre>
<pre> May the source be with you!<o:p></o:p></pre>
</div>
</blockquote>
</div>
</div>
</body>
</html>