<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body smarttemplateinserted="true">
Thanks Doug -- your cluster is bigger than mine, and your answer ("a
few seconds") is much closer to what I was expecting to see here.<br>
<br>
> Do you know if all the slurmstepd's are starting quickly on the
compute nodes?<br>
<br>
We'll be looking into this.<br>
<br>
> How is the OS/Slurm/executable delivered to the node?<br>
<br>
Particularly in the case of "srun hostname", everything is on local
disks.<br>
<br>
Thanks again!<br>
Andy<br>
<br>
<div id="smartTemplate4-quoteHeader">
<hr> <b>From:</b> Douglas Jacobsen <a class="moz-txt-link-rfc2396E" href="mailto:dmjacobsen@lbl.gov"><dmjacobsen@lbl.gov></a> <br>
<b>Sent:</b> Friday, April 26, 2019 10:46AM <br>
<b>To:</b> Slurm User Community List
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
<b>Cc:</b> <br>
<b>Subject:</b> Re: [slurm-users] job startup timeouts? <br>
</div>
<div class="replaced-blockquote"
cite="mid:CAHaWJFGH+NHjTWjj2hCmZV2_Z89-mWjNPMiWe_AJaJ2yFjq=1g@mail.gmail.com"
type="cite">
<pre class="moz-quote-pre" wrap="">We have 12,000 nodes in our system, 9,600 of which are KNL. We can
start a parallel application within a few seconds in most cases (when
the machine is dedicated to this task), even at full scale. So I
don't think there is anything intrinsic to Slurm that would
necessarily be limiting you, though we have seen cases in the past
where arbitrary task distribution has caused contoller slow-down
issues as the detailed scheme was parsed.
Do you know if all the slurmstepd's are starting quickly on the
compute nodes? How is the OS/Slurm/executable delivered to the node?
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
<a class="moz-txt-link-abbreviated" href="mailto:dmjacobsen@lbl.gov">dmjacobsen@lbl.gov</a>
------------- __o
---------- _ '\<,_
----------(_)/ (_)__________________________
On Fri, Apr 26, 2019 at 7:40 AM Riebs, Andy <a class="moz-txt-link-rfc2396E" href="mailto:andy.riebs@hpe.com"><andy.riebs@hpe.com></a> wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Thanks for the quick response Doug!
Unfortunately, I can't be specific about the cluster size, other than to say it's got more than a thousand nodes.
In a separate test that I had missed, even "srun hostname" took 5 minutes to run. So there was no remote file system or MPI involvement.
Andy
-----Original Message-----
From: slurm-users [<a class="moz-txt-link-freetext" href="mailto:slurm-users-bounces@lists.schedmd.com">mailto:slurm-users-bounces@lists.schedmd.com</a>] On Behalf Of Douglas Jacobsen
Sent: Friday, April 26, 2019 9:24 AM
To: Slurm User Community List <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a>
Subject: Re: [slurm-users] job startup timeouts?
How large is very large? Where is the executable being started? In
the parallel filesystem/NFS? If that is the case you may be able to
trim start times by using sbcast to transfer the executable (and its
dependencies if dynamically linked) into a node-local resource, such
as /tmp or /dev/shm depending on your local configuration.
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
<a class="moz-txt-link-abbreviated" href="mailto:dmjacobsen@lbl.gov">dmjacobsen@lbl.gov</a>
------------- __o
---------- _ '\<,_
----------(_)/ (_)__________________________
On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs <a class="moz-txt-link-rfc2396E" href="mailto:andy.riebs@hpe.com"><andy.riebs@hpe.com></a> wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Hi All,
We've got a very large x86_64 cluster with lots of cores on each node, and hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on CentOS 7.6.
We have a job that reports
srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks
srun: Job step 291963.0 aborted before step completely launched.
when we try to run it at large scale. We anticipate that it could take as long as 15 minutes for the job to launch, based on our experience with smaller numbers of nodes.
Is there a timeout setting that we're missing that can be changed to accommodate a lengthy startup time like this?
Andy
--
Andy Riebs
<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
</pre>
</blockquote>
</blockquote>
</div>
<br>
</body>
</html>