<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body smarttemplateinserted="true">

    Hi All,<br>

    <br>

    We've got a very large x86_64 cluster with lots of cores on each

    node, and hyper-threading enabled. We're running Slurm 18.08.7 with

    Open MPI 4.x on CentOS 7.6.<br>

    <br>

    We have a job that reports<br>

    <blockquote>srun: error: timeout waiting for task launch, started 0

      of xxxxxx tasks<br>

      srun: Job step 291963.0 aborted before step completely launched.<br>

    </blockquote>

    when we try to run it at large scale. We anticipate that it could

    take as long as 15 minutes for the job to launch, based on our

    experience with smaller numbers of nodes.<br>

    <br>

    Is there a timeout setting that we're missing that can be changed to

    accommodate a lengthy startup time like this?<br>

    <br>

    Andy<br>

    <br>

    --

    <pre class="moz-signature" cols="72">Andy Riebs

<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>

Hewlett-Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

My opinions are not necessarily those of HPE

    May the source be with you!

</pre>

  </body>

</html>