[slurm-users] Queue size, slow/unresponsive head node

Fri Jan 12 12:00:58 MST 2018

Nicholas,

> Why do you have?
> SchedulerParameters     = (null)
I did not set these parameters, so I assume "(null)" means all the 
default values are used.

John,

thanks, I'll try that, and look into these SchedulerParameter more.

Cheers,
Colas

On 2018-01-12 09:08, John DeSantis wrote:
> Colas,
>
> We had a similar experience a long time ago, and we solved it by adding
> the following SchedulerParameters:
>
> max_rpc_cnt=150,defer
>
> HTH,
> John DeSantis
>
> On Thu, 11 Jan 2018 16:39:43 -0500
> Colas Rivière <riviere at umdgrb.umd.edu> wrote:
>
>> Hello,
>>
>> I'm managing a small cluster (one head node, 24 workers, 1160 total
>> worker threads). The head node has two E5-2680 v3 CPUs
>> (hyper-threaded), ~100 GB of memory and spinning disks.
>> The head node becomes occasionally less responsive when there are
>> more than 10k jobs in queue, and becomes really unmanageable when
>> reaching 100k jobs in queue, with error messages such as:
>>> sbatch: error: Slurm temporarily unable to accept job, sleeping and
>>> retrying.
>> or
>>> Running: slurm_load_jobs error: Socket timed out on send/recv
>>> operation
>> Is that normal to experience slowdowns when the queue reaches this
>> few 10k jobs? What limit should I expect? Would adding a SSD drive
>> for SlurmdSpoolDir help? What can be done to push this limit?
>>
>> The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
>> (from `scontrol show config`).
>>
>> Thanks,
>> Colas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180112/d4340dad/attachment.html>