[slurm-users] ticking time bomb? launching too many jobs in parallel
Paul Edmon
pedmon at cfa.harvard.edu
Tue Aug 27 14:05:15 UTC 2019
At least for our cluster we generally recommend that if you are
submitting large numbers of jobs you either use a job array or you just
for loop over the jobs you want to submit. A fork bomb is definitely
not recommended. For highest throughput submission a job array is your
best bet as in one submission it will generate thousands of jobs which
then the scheduler can handle sensibly. So I highly recommend using job
arrays.
-Paul Edmon-
On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote:
> Hi Paul,
>
> Thanks a lot for your suggestion.
>
> The cluster I'm using has thousands of users, so I'm doubtful the
> admins will change this setting just for me. But I'll mention it to
> the support team I'm working with.
>
> I was hoping more for something that can be done on the user end.
>
> Is there some way for the user to measure whether the scheduler is in
> RPC saturation? And then if it is, I could make sure my script doesn't
> launch too many jobs in parallel.
>
> Sorry if my question is too vague, I don't understand the backend of
> the SLURM scheduler too well, so my questions are using the limited
> terminology of a user.
>
> My concern is just to make sure that my scripts don't send out more
> commands (simultaneously) than the scheduler can handle.
>
> For example, as an extreme scenario, suppose a user forks off 1000
> sbatch commands in parallel, is that more than the scheduler can
> handle? As a user, how can I know whether it is?
>
> Regards,
> Guillaume.
>
>
>
> On Mon, Aug 26, 2019 at 10:15 AM Paul Edmon <pedmon at cfa.harvard.edu
> <mailto:pedmon at cfa.harvard.edu>> wrote:
>
> We've hit this before due to RPC saturation. I highly recommend
> using max_rpc_cnt and/or defer for scheduling. That should help
> alleviate this problem.
>
> -Paul Edmon-
>
> On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:
>> Hello,
>>
>> I wrote a regression-testing toolkit to manage large numbers of
>> SLURM jobs and their output (the toolkit can be found here
>> <https://github.com/gobbedy/slurm_simulation_toolkit/> if anyone
>> is interested).
>>
>> To make job launching faster, sbatch commands are forked, so that
>> numerous jobs may be submitted in parallel.
>>
>> We (the cluster admin and myself) are concerned that this may
>> cause unresponsiveness for other users.
>>
>> I cannot say for sure since I don't have visibility over all
>> users of the cluster, but unresponsiveness doesn't seem to have
>> occurred so far. That being said, the fact that it hasn't
>> occurred yet doesn't mean it won't in the future. So I'm treating
>> this as a ticking time bomb to be fixed asap.
>>
>> My questions are the following:
>> 1) Does anyone have experience with large numbers of jobs
>> submitted in parallel? What are the limits that can be hit? For
>> example is there some hard limit on how many jobs a SLURM
>> scheduler can handle before blacking out / slowing down?
>> 2) Is there a way for me to find/measure/ping this resource limit?
>> 3) How can I make sure I don't hit this resource limit?
>>
>> From what I've observed, parallel submission can improve
>> submission time by a factor at least 10x. This can make a big
>> difference in users' workflows.
>>
>> For that reason I would like to keep the option of launching jobs
>> sequentially as a last resort.
>>
>> Thanks in advance.
>>
>> Regards,
>> Guillaume.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190827/c1963e83/attachment.htm>
More information about the slurm-users
mailing list