[slurm-users] job startup timeouts?

Fri Apr 26 13:24:25 UTC 2019

How large is very large?  Where is the executable being started?  In
the parallel filesystem/NFS?  If that is the case you may be able to
trim start times by using sbcast to transfer the executable (and its
dependencies if dynamically linked) into a node-local resource, such
as /tmp or /dev/shm depending on your local configuration.
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
dmjacobsen at lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs <andy.riebs at hpe.com> wrote:
>
> Hi All,
>
> We've got a very large x86_64 cluster with lots of cores on each node, and hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on CentOS 7.6.
>
> We have a job that reports
>
> srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks
> srun: Job step 291963.0 aborted before step completely launched.
>
> when we try to run it at large scale. We anticipate that it could take as long as 15 minutes for the job to launch, based on our experience with smaller numbers of nodes.
>
> Is there a timeout setting that we're missing that can be changed to accommodate a lengthy startup time like this?
>
> Andy
>
> --
>
> Andy Riebs
> andy.riebs at hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering
> +1 404 648 9024
> My opinions are not necessarily those of HPE
>     May the source be with you!