[slurm-users] ticking time bomb? launching too many jobs in parallel

Guillaume Perrault Archambault gperr050 at uottawa.ca
Mon Aug 26 06:12:42 UTC 2019


Hello,

I wrote a regression-testing toolkit to manage large numbers of SLURM jobs
and their output (the toolkit can be found here
<https://github.com/gobbedy/slurm_simulation_toolkit/> if anyone is
interested).

To make job launching faster, sbatch commands are forked, so that numerous
jobs may be submitted in parallel.

We (the cluster admin and myself) are concerned that this may cause
unresponsiveness for other users.

I cannot say for sure since I don't have visibility over all users of the
cluster, but unresponsiveness doesn't seem to have occurred so far. That
being said, the fact that it hasn't occurred yet doesn't mean it won't in
the future. So I'm treating this as a ticking time bomb to be fixed asap.

My questions are the following:
1) Does anyone have experience with large numbers of jobs submitted in
parallel? What are the limits that can be hit? For example is there some
hard limit on how many jobs a SLURM scheduler can handle before blacking
out / slowing down?
2) Is there a way for me to find/measure/ping this resource limit?
3) How can I make sure I don't hit this resource limit?

>From what I've observed, parallel submission can improve submission time by
a factor at least 10x. This can make a big difference in users' workflows.

For that reason I would like to keep the option of launching jobs
sequentially as a last resort.

Thanks in advance.

Regards,
Guillaume.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190826/64f467e1/attachment.htm>


More information about the slurm-users mailing list