<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>At least for our cluster we generally recommend that if you are
submitting large numbers of jobs you either use a job array or you
just for loop over the jobs you want to submit. A fork bomb is
definitely not recommended. For highest throughput submission a
job array is your best bet as in one submission it will generate
thousands of jobs which then the scheduler can handle sensibly.
So I highly recommend using job arrays.</p>
<p>-Paul Edmon-<br>
</p>
<div class="moz-cite-prefix">On 8/27/19 3:45 AM, Guillaume Perrault
Archambault wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAG1OYp03kcr159miG2NqG24SJx651sjtez1UH_s_hTtZecBSWg@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hi Paul,
<div><br>
</div>
<div>Thanks a lot for your suggestion.</div>
<div><br>
</div>
<div>The cluster I'm using has thousands of users, so I'm
doubtful the admins will change this setting just for me. But
I'll mention it to the support team I'm working with.</div>
<div><br>
</div>
<div>I was hoping more for something that can be done on the
user end.</div>
<div><br>
</div>
<div>Is there some way for the user to measure whether the
scheduler is in RPC saturation? And then if it is, I could
make sure my script doesn't launch too many jobs in parallel.</div>
<div><br>
</div>
<div>Sorry if my question is too vague, I don't understand the
backend of the SLURM scheduler too well, so my questions are
using the limited terminology of a user.</div>
<div><br>
</div>
<div>My concern is just to make sure that my scripts don't send
out more commands (simultaneously) than the scheduler can
handle.</div>
<div><br>
</div>
<div>For example, as an extreme scenario, suppose a user forks
off 1000 sbatch commands in parallel, is that more than the
scheduler can handle? As a user, how can I know whether it is?</div>
<div><br>
</div>
<div>Regards,</div>
<div>Guillaume.</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Aug 26, 2019 at 10:15
AM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"
moz-do-not-send="true">pedmon@cfa.harvard.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>We've hit this before due to RPC saturation. I highly
recommend using max_rpc_cnt and/or defer for scheduling.
That should help alleviate this problem.</p>
<p>-Paul Edmon-<br>
</p>
<div class="gmail-m_7693702140876103168moz-cite-prefix">On
8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hello,
<div><br>
</div>
<div>I wrote a regression-testing toolkit to manage
large numbers of SLURM jobs and their output (the
toolkit can be found <a
href="https://github.com/gobbedy/slurm_simulation_toolkit/"
target="_blank" moz-do-not-send="true">here</a> if
anyone is interested).</div>
<div><br>
</div>
<div>To make job launching faster, sbatch commands are
forked, so that numerous jobs may be submitted in
parallel.</div>
<div><br>
</div>
<div>We (the cluster admin and myself) are concerned
that this may cause unresponsiveness for other users.</div>
<div><br>
</div>
<div>I cannot say for sure since I don't have visibility
over all users of the cluster, but unresponsiveness
doesn't seem to have occurred so far. That being said,
the fact that it hasn't occurred yet doesn't mean it
won't in the future. So I'm treating this as a ticking
time bomb to be fixed asap.</div>
<div><br>
</div>
<div>My questions are the following:</div>
<div>1) Does anyone have experience with large numbers
of jobs submitted in parallel? What are the limits
that can be hit? For example is there some hard limit
on how many jobs a SLURM scheduler can handle before
blacking out / slowing down?</div>
<div>2) Is there a way for me to find/measure/ping this
resource limit?</div>
<div>3) How can I make sure I don't hit this resource
limit?</div>
<div><br>
</div>
<div>From what I've observed, parallel submission can
improve submission time by a factor at least 10x. This
can make a big difference in users' workflows. </div>
<div><br>
</div>
<div>For that reason I would like to keep the option of
launching jobs sequentially as a last resort.</div>
<div><br>
</div>
<div>Thanks in advance.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Guillaume.</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>