[slurm-users] Queue size, slow/unresponsive head node

Nicholas C Santucci santucci at uci.edu
Thu Jan 11 22:25:22 MST 2018

Why do you have?

SchedulerParameters     = (null)

Is that even allowed


On Thu, Jan 11, 2018 at 1:39 PM, Colas Rivière <riviere at umdgrb.umd.edu>

> Hello,
> I'm managing a small cluster (one head node, 24 workers, 1160 total worker
> threads). The head node has two E5-2680 v3 CPUs (hyper-threaded), ~100 GB
> of memory and spinning disks.
> The head node becomes occasionally less responsive when there are more
> than 10k jobs in queue, and becomes really unmanageable when reaching 100k
> jobs in queue, with error messages such as:
>> sbatch: error: Slurm temporarily unable to accept job, sleeping and
>> retrying.
> or
>> Running: slurm_load_jobs error: Socket timed out on send/recv operation
> Is that normal to experience slowdowns when the queue reaches this few 10k
> jobs? What limit should I expect? Would adding a SSD drive for
> SlurmdSpoolDir help? What can be done to push this limit?
> The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
> (from `scontrol show config`).
> Thanks,
> Colas

Nick Santucci
santucci at uci.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180111/2c75ec17/attachment-0003.html>

More information about the slurm-users mailing list