<div dir="ltr"><div>Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires through job.</div><div><br></div><div>Doug<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    Hi All,<br>

    <br>

    Just checking to see if this sounds familiar to anyone.<br>

    <br>

    Environment:<br>

    - CentOS 7.5 x86_64<br>

    - Slurm 17.11.10 (but this also happened with 17.11.5)<br>

    <br>

    We typically run about 100 tests/night, selected from a handful of

    favorites. For roughly 1 in 300 test runs, we see one of two

    mysterious failures:<br>

    <br>

    1. The 5 minute cancellation<br>

    <br>

    A job will be rolling along, generating it's expected output, and

    then this message appears:<br>

    <blockquote>srun: forcing job termination<br>

      srun: Job step aborted: Waiting up to 32 seconds for job step to

      finish.<br>

      slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT

      2019-01-30T07:35:50 ***<br>

      srun: error: nodename: task 250: Terminated<br>

      srun: Terminating job step 3531.0<br>

    </blockquote>

    sacct reports<br>

    <blockquote><tt>       JobID               Start                 End

        ExitCode      State </tt><br>

      <tt>------------ ------------------- ------------------- --------

        ---------- </tt><br>

      <tt>3418         2019-01-29T05:54:07 2019-01-29T05:59:16     

        0:9     FAILED</tt><br>

    </blockquote>

    These failures consistently happen at just about 5 minutes into the

    run when they happen.<br>

    <br>

    2. The random cancellation<br>

    <br>

    As above, a job will be generating the expected output, and then we

    see<br>

    <blockquote>srun: forcing job termination<br>

      srun: Job step aborted: Waiting up to 32 seconds for job step to

      finish.<br>

      slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT

      2019-01-30T07:35:50 ***<br>

      srun: error: nodename: task 250: Terminated<br>

      srun: Terminating job step 3531.0<br>

    </blockquote>

    But this time, sacct reports<br>

    <blockquote><tt>       JobID               Start                 End

        ExitCode      State </tt><br>

      <tt>------------ ------------------- ------------------- --------

        ---------- </tt><br>

      <tt>3531         2019-01-30T07:21:25 2019-01-30T07:35:50      0:0 

        COMPLETED </tt><br>

      <tt>3531.0       2019-01-30T07:21:27 2019-01-30T07:35:56     0:15 

        CANCELLED </tt><br>

    </blockquote>

    I think we've seen these cancellations pop up as soon as a minute or

    two into the test run, up to perhaps 20 minutes into the run.<br>

    <br>

    The only thing slightly unusual in our job submissions is that we

    use srun's "--immediate=120" so that the scripts can respond

    appropriately if a node goes down.<br>

    <br>

    With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a

    clue in the slurmctld or slurmd logs.<br>

    <br>

    Any thoughts on what might be happening, or what I might try next?<br>

    <br>

    Andy<br>

    <br>

    <pre class="gmail-m_-5512260739653731564moz-signature" cols="72">-- 

Andy Riebs

<a class="gmail-m_-5512260739653731564moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" target="_blank">andy.riebs@hpe.com</a>

Hewlett-Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

My opinions are not necessarily those of HPE

    May the source be with you!

</pre>

  </div>

</blockquote></div>