<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Just a couple comments from experience in general:</p>
<p>1) If you can, either use xargs or parallel to do the forking so
you can limit the number of simultaneous submissions</p>
<p>2) I have yet to see where it is a good idea to have many
separate jobs when using an array can work. <br>
</p>
<p> If you can prep up a proper input file for a script, a single
submission is all it takes. Then you can control how many are
currently running (MaxArrayTask) and can change that to scale
up/down.</p>
<p><br>
</p>
<p>Brian Andrus<br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 8/25/2019 11:12 PM, Guillaume
Perrault Archambault wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAG1OYp0CTPOpLdfdOUzUh-y=EuWx+xb+U2VGNwBrXJ_9-HTyMQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hello,
<div><br>
</div>
<div>I wrote a regression-testing toolkit to manage large
numbers of SLURM jobs and their output (the toolkit can be
found <a
href="https://github.com/gobbedy/slurm_simulation_toolkit/"
moz-do-not-send="true">here</a> if anyone is interested).</div>
<div><br>
</div>
<div>To make job launching faster, sbatch commands are forked,
so that numerous jobs may be submitted in parallel.</div>
<div><br>
</div>
<div>We (the cluster admin and myself) are concerned that this
may cause unresponsiveness for other users.</div>
<div><br>
</div>
<div>I cannot say for sure since I don't have visibility over
all users of the cluster, but unresponsiveness doesn't seem to
have occurred so far. That being said, the fact that it hasn't
occurred yet doesn't mean it won't in the future. So I'm
treating this as a ticking time bomb to be fixed asap.</div>
<div><br>
</div>
<div>My questions are the following:</div>
<div>1) Does anyone have experience with large numbers of jobs
submitted in parallel? What are the limits that can be hit?
For example is there some hard limit on how many jobs a SLURM
scheduler can handle before blacking out / slowing down?</div>
<div>2) Is there a way for me to find/measure/ping this resource
limit?</div>
<div>3) How can I make sure I don't hit this resource limit?</div>
<div><br>
</div>
<div>From what I've observed, parallel submission can improve
submission time by a factor at least 10x. This can make a big
difference in users' workflows. </div>
<div><br>
</div>
<div>For that reason I would like to keep the option of
launching jobs sequentially as a last resort.</div>
<div><br>
</div>
<div>Thanks in advance.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Guillaume.</div>
</div>
</blockquote>
</body>
</html>