[slurm-users] Queue size, slow/unresponsive head node

Thu Jan 11 22:25:22 MST 2018

Why do you have?

SchedulerParameters     = (null)

Is that even allowed
?

https://slurm.schedmd.com/sched_config.html

On Thu, Jan 11, 2018 at 1:39 PM, Colas Rivière <riviere at umdgrb.umd.edu>
wrote:

> Hello,
>
> I'm managing a small cluster (one head node, 24 workers, 1160 total worker
> threads). The head node has two E5-2680 v3 CPUs (hyper-threaded), ~100 GB
> of memory and spinning disks.
> The head node becomes occasionally less responsive when there are more
> than 10k jobs in queue, and becomes really unmanageable when reaching 100k
> jobs in queue, with error messages such as:
>
>> sbatch: error: Slurm temporarily unable to accept job, sleeping and
>> retrying.
>>
> or
>
>> Running: slurm_load_jobs error: Socket timed out on send/recv operation
>>
> Is that normal to experience slowdowns when the queue reaches this few 10k
> jobs? What limit should I expect? Would adding a SSD drive for
> SlurmdSpoolDir help? What can be done to push this limit?
>
> The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
> (from `scontrol show config`).
>
> Thanks,
> Colas
>

-- 
Nick Santucci
santucci at uci.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180111/2c75ec17/attachment-0003.html>