[slurm-users] ticking time bomb? launching too many jobs in parallel
Brian Andrus
toomuchit at gmail.com
Tue Aug 27 16:47:20 UTC 2019
Just a couple comments from experience in general:
1) If you can, either use xargs or parallel to do the forking so you can
limit the number of simultaneous submissions
2) I have yet to see where it is a good idea to have many separate jobs
when using an array can work.
If you can prep up a proper input file for a script, a single
submission is all it takes. Then you can control how many are currently
running (MaxArrayTask) and can change that to scale up/down.
Brian Andrus
On 8/25/2019 11:12 PM, Guillaume Perrault Archambault wrote:
> Hello,
>
> I wrote a regression-testing toolkit to manage large numbers of SLURM
> jobs and their output (the toolkit can be found here
> <https://github.com/gobbedy/slurm_simulation_toolkit/> if anyone is
> interested).
>
> To make job launching faster, sbatch commands are forked, so that
> numerous jobs may be submitted in parallel.
>
> We (the cluster admin and myself) are concerned that this may cause
> unresponsiveness for other users.
>
> I cannot say for sure since I don't have visibility over all users of
> the cluster, but unresponsiveness doesn't seem to have occurred so
> far. That being said, the fact that it hasn't occurred yet doesn't
> mean it won't in the future. So I'm treating this as a ticking time
> bomb to be fixed asap.
>
> My questions are the following:
> 1) Does anyone have experience with large numbers of jobs submitted in
> parallel? What are the limits that can be hit? For example is there
> some hard limit on how many jobs a SLURM scheduler can handle before
> blacking out / slowing down?
> 2) Is there a way for me to find/measure/ping this resource limit?
> 3) How can I make sure I don't hit this resource limit?
>
> From what I've observed, parallel submission can improve submission
> time by a factor at least 10x. This can make a big difference in
> users' workflows.
>
> For that reason I would like to keep the option of launching jobs
> sequentially as a last resort.
>
> Thanks in advance.
>
> Regards,
> Guillaume.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190827/10328003/attachment.htm>
More information about the slurm-users
mailing list