[slurm-users] job startup timeouts?

Andy Riebs andy.riebs at hpe.com
Fri Apr 26 12:26:31 UTC 2019


Hi All,

We've got a very large x86_64 cluster with lots of cores on each node, 
and hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 
4.x on CentOS 7.6.

We have a job that reports

    srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks
    srun: Job step 291963.0 aborted before step completely launched.

when we try to run it at large scale. We anticipate that it could take 
as long as 15 minutes for the job to launch, based on our experience 
with smaller numbers of nodes.

Is there a timeout setting that we're missing that can be changed to 
accommodate a lengthy startup time like this?

Andy

-- 

Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190426/b757de95/attachment.html>


More information about the slurm-users mailing list