[slurm-users] job startup timeouts?

John Hearns hearnsj at googlemail.com
Fri Apr 26 14:56:25 UTC 2019


It's a DNS problem, isn't it?   Seriously though - how long does srun
hostname take for a single system?


On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen <dmjacobsen at lbl.gov> wrote:

> We have 12,000 nodes in our system, 9,600 of which are KNL.  We can
> start a parallel application within a few seconds in most cases (when
> the machine is dedicated to this task), even at full scale.  So I
> don't think there is anything intrinsic to Slurm that would
> necessarily be limiting you, though we have seen cases in the past
> where arbitrary task distribution has caused contoller slow-down
> issues as the detailed scheme was parsed.
>
> Do you know if all the slurmstepd's are starting quickly on the
> compute nodes?  How is the OS/Slurm/executable delivered to the node?
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> Acting Group Lead, Computational Systems Group
> National Energy Research Scientific Computing Center
> dmjacobsen at lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Fri, Apr 26, 2019 at 7:40 AM Riebs, Andy <andy.riebs at hpe.com> wrote:
> >
> > Thanks for the quick response Doug!
> >
> > Unfortunately, I can't be specific about the cluster size, other than to
> say it's got more than a thousand nodes.
> >
> > In a separate test that I had missed, even "srun hostname" took 5
> minutes to run. So there was no remote file system or MPI involvement.
> >
> > Andy
> >
> > -----Original Message-----
> > From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On
> Behalf Of Douglas Jacobsen
> > Sent: Friday, April 26, 2019 9:24 AM
> > To: Slurm User Community List <slurm-users at lists.schedmd.com>
> > Subject: Re: [slurm-users] job startup timeouts?
> >
> > How large is very large?  Where is the executable being started?  In
> > the parallel filesystem/NFS?  If that is the case you may be able to
> > trim start times by using sbcast to transfer the executable (and its
> > dependencies if dynamically linked) into a node-local resource, such
> > as /tmp or /dev/shm depending on your local configuration.
> > ----
> > Doug Jacobsen, Ph.D.
> > NERSC Computer Systems Engineer
> > Acting Group Lead, Computational Systems Group
> > National Energy Research Scientific Computing Center
> > dmjacobsen at lbl.gov
> >
> > ------------- __o
> > ---------- _ '\<,_
> > ----------(_)/  (_)__________________________
> >
> >
> > On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs <andy.riebs at hpe.com> wrote:
> > >
> > > Hi All,
> > >
> > > We've got a very large x86_64 cluster with lots of cores on each node,
> and hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x
> on CentOS 7.6.
> > >
> > > We have a job that reports
> > >
> > > srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks
> > > srun: Job step 291963.0 aborted before step completely launched.
> > >
> > > when we try to run it at large scale. We anticipate that it could take
> as long as 15 minutes for the job to launch, based on our experience with
> smaller numbers of nodes.
> > >
> > > Is there a timeout setting that we're missing that can be changed to
> accommodate a lengthy startup time like this?
> > >
> > > Andy
> > >
> > > --
> > >
> > > Andy Riebs
> > > andy.riebs at hpe.com
> > > Hewlett-Packard Enterprise
> > > High Performance Computing Software Engineering
> > > +1 404 648 9024
> > > My opinions are not necessarily those of HPE
> > >     May the source be with you!
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190426/c899b452/attachment-0001.html>


More information about the slurm-users mailing list