<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Nicholas,<br>

    <br>

    <blockquote type="cite">

      <div class="gmail_default"

        style="font-family:monospace,monospace;font-size:small">Why do

        you have?</div>

      <div class="gmail_default"

        style="font-family:monospace,monospace;font-size:small">

        <pre class="gmail-aLF-aPX-K0-aPE">SchedulerParameters     = (null)</pre>

      </div>

    </blockquote>

    I did not set these parameters, so I assume "(null)" means all the

    default values are used.<br>

    <br>

    John,<br>

    <br>

    thanks, I'll try that, and look into these SchedulerParameter more.<br>

    <br>

    Cheers,<br>

    Colas<br>

    <br>

    <div class="moz-cite-prefix">On 2018-01-12 09:08, John DeSantis

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:20180112090801.0abb9915@fusionb0x">

      <pre wrap="">Colas,

We had a similar experience a long time ago, and we solved it by adding

the following SchedulerParameters:

max_rpc_cnt=150,defer

HTH,

John DeSantis

On Thu, 11 Jan 2018 16:39:43 -0500

Colas Rivière <a class="moz-txt-link-rfc2396E" href="mailto:riviere@umdgrb.umd.edu"><riviere@umdgrb.umd.edu></a> wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">Hello,

I'm managing a small cluster (one head node, 24 workers, 1160 total 

worker threads). The head node has two E5-2680 v3 CPUs

(hyper-threaded), ~100 GB of memory and spinning disks.

The head node becomes occasionally less responsive when there are

more than 10k jobs in queue, and becomes really unmanageable when

reaching 100k jobs in queue, with error messages such as:

</pre>

        <blockquote type="cite">

          <pre wrap="">sbatch: error: Slurm temporarily unable to accept job, sleeping and 

retrying.

</pre>

        </blockquote>

        <pre wrap="">or

</pre>

        <blockquote type="cite">

          <pre wrap="">Running: slurm_load_jobs error: Socket timed out on send/recv

operation

</pre>

        </blockquote>

        <pre wrap="">Is that normal to experience slowdowns when the queue reaches this

few 10k jobs? What limit should I expect? Would adding a SSD drive

for SlurmdSpoolDir help? What can be done to push this limit?

The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached 

(from `scontrol show config`).

Thanks,

Colas

</pre>

      </blockquote>

      <pre wrap="">

</pre>

    </blockquote>

    <br>

  </body>

</html>