<div dir="ltr">Hi Paul,<div><br></div><div>Thanks a lot for your suggestion.</div><div><br></div><div>The cluster I'm using has thousands of users, so I'm doubtful the admins will change this setting just for me. But I'll mention it to the support team I'm working with.</div><div><br></div><div>I was hoping more for something that can be done on the user end.</div><div><br></div><div>Is there some way for the user to measure whether the scheduler is in RPC saturation? And then if it is, I could make sure my script doesn't launch too many jobs in parallel.</div><div><br></div><div>Sorry if my question is too vague, I don't understand the backend of the SLURM scheduler too well, so my questions are using the limited terminology of a user.</div><div><br></div><div>My concern is just to make sure that my scripts don't send out more commands (simultaneously) than the scheduler can handle.</div><div><br></div><div>For example, as an extreme scenario, suppose a user forks off 1000 sbatch commands in parallel, is that more than the scheduler can handle? As a user, how can I know whether it is?</div><div><br></div><div>Regards,</div><div>Guillaume.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Aug 26, 2019 at 10:15 AM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu">pedmon@cfa.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF">

    <p>We've hit this before due to RPC saturation.  I highly recommend

      using max_rpc_cnt and/or defer for scheduling.  That should help

      alleviate this problem.</p>

    <p>-Paul Edmon-<br>

    </p>

    <div class="gmail-m_7693702140876103168moz-cite-prefix">On 8/26/19 2:12 AM, Guillaume Perrault

      Archambault wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">Hello,

        <div><br>

        </div>

        <div>I wrote a regression-testing toolkit to manage large

          numbers of SLURM jobs and their output (the toolkit can be

          found <a href="https://github.com/gobbedy/slurm_simulation_toolkit/" target="_blank">here</a> if anyone is interested).</div>

        <div><br>

        </div>

        <div>To make job launching faster, sbatch commands are forked,

          so that numerous jobs may be submitted in parallel.</div>

        <div><br>

        </div>

        <div>We (the cluster admin and myself) are concerned that this

          may cause unresponsiveness for other users.</div>

        <div><br>

        </div>

        <div>I cannot say for sure since I don't have visibility over

          all users of the cluster, but unresponsiveness doesn't seem to

          have occurred so far. That being said, the fact that it hasn't

          occurred yet doesn't mean it won't in the future. So I'm

          treating this as a ticking time bomb to be fixed asap.</div>

        <div><br>

        </div>

        <div>My questions are the following:</div>

        <div>1) Does anyone have experience with large numbers of jobs

          submitted in parallel? What are the limits that can be hit?

          For example is there some hard limit on how many jobs a SLURM

          scheduler can handle before blacking out / slowing down?</div>

        <div>2) Is there a way for me to find/measure/ping this resource

          limit?</div>

        <div>3) How can I make sure I don't hit this resource limit?</div>

        <div><br>

        </div>

        <div>From what I've observed, parallel submission can improve

          submission time by a factor at least 10x. This can make a big

          difference in users' workflows. </div>

        <div><br>

        </div>

        <div>For that reason I would like to keep the option of

          launching jobs sequentially as a last resort.</div>

        <div><br>

        </div>

        <div>Thanks in advance.</div>

        <div><br>

        </div>

        <div>Regards,</div>

        <div>Guillaume.</div>

      </div>

    </blockquote>

  </div>

</blockquote></div>