[slurm-users] ticking time bomb? launching too many jobs in parallel

Tue Aug 27 16:01:21 UTC 2019

Thanks Ole for giving so much thought into my question. I'll pass a long
these suggestions. Unfortunately as a user there's not a whole lot I can do
about the choice of hardware.

Thanks for the link to the guide, I'll have a look at it. Even as a user
it's helpful to be well informed on the admin side :)

Regards,
Guillaume.

On Tue, Aug 27, 2019 at 4:26 AM Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
wrote:

> Hi Guillaume,
>
> The performance of the slurmctld server depends strongly on the server
> hardware on which it is running!  This should be taken into account when
> considering your question.
>
> SchedMD recommends that the slurmctld server should have only a few, but
> very fast CPU cores, in order to ensure the best responsiveness.
>
> The file system for /var/spool/slurmctld/ should be mounted on the
> fastest possible disks (SSD or NVMe if possible).
>
> You should also read the Large Cluster Administration Guide at
> https://slurm.schedmd.com/big_sys.html
>
> Furthermore, it may perhaps be a good idea to have the MySQL database
> server installed on a separate server so that it doesn't slow down the
> slurmctld.
>
> Best regards,
> Ole
>
> On 8/27/19 9:45 AM, Guillaume Perrault Archambault wrote:
> > Hi Paul,
> >
> > Thanks a lot for your suggestion.
> >
> > The cluster I'm using has thousands of users, so I'm doubtful the admins
> > will change this setting just for me. But I'll mention it to the support
> > team I'm working with.
> >
> > I was hoping more for something that can be done on the user end.
> >
> > Is there some way for the user to measure whether the scheduler is in
> > RPC saturation? And then if it is, I could make sure my script doesn't
> > launch too many jobs in parallel.
> >
> > Sorry if my question is too vague, I don't understand the backend of the
> > SLURM scheduler too well, so my questions are using the limited
> > terminology of a user.
> >
> > My concern is just to make sure that my scripts don't send out more
> > commands (simultaneously) than the scheduler can handle.
> >
> > For example, as an extreme scenario, suppose a user forks off 1000
> > sbatch commands in parallel, is that more than the scheduler can handle?
> > As a user, how can I know whether it is?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190827/b2e13aa6/attachment-0001.htm>