[slurm-users] ticking time bomb? launching too many jobs in parallel

Paul Edmon pedmon at cfa.harvard.edu
Mon Aug 26 14:13:05 UTC 2019

We've hit this before due to RPC saturation.  I highly recommend using 
max_rpc_cnt and/or defer for scheduling.  That should help alleviate 
this problem.

-Paul Edmon-

On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:
> Hello,
> I wrote a regression-testing toolkit to manage large numbers of SLURM 
> jobs and their output (the toolkit can be found here 
> <https://github.com/gobbedy/slurm_simulation_toolkit/> if anyone is 
> interested).
> To make job launching faster, sbatch commands are forked, so that 
> numerous jobs may be submitted in parallel.
> We (the cluster admin and myself) are concerned that this may cause 
> unresponsiveness for other users.
> I cannot say for sure since I don't have visibility over all users of 
> the cluster, but unresponsiveness doesn't seem to have occurred so 
> far. That being said, the fact that it hasn't occurred yet doesn't 
> mean it won't in the future. So I'm treating this as a ticking time 
> bomb to be fixed asap.
> My questions are the following:
> 1) Does anyone have experience with large numbers of jobs submitted in 
> parallel? What are the limits that can be hit? For example is there 
> some hard limit on how many jobs a SLURM scheduler can handle before 
> blacking out / slowing down?
> 2) Is there a way for me to find/measure/ping this resource limit?
> 3) How can I make sure I don't hit this resource limit?
> From what I've observed, parallel submission can improve submission 
> time by a factor at least 10x. This can make a big difference in 
> users' workflows.
> For that reason I would like to keep the option of launching jobs 
> sequentially as a last resort.
> Thanks in advance.
> Regards,
> Guillaume.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190826/ce7ead2e/attachment.htm>

More information about the slurm-users mailing list