[slurm-users] Queue size, slow/unresponsive head node

Fri Jan 12 07:08:01 MST 2018

Colas,

We had a similar experience a long time ago, and we solved it by adding
the following SchedulerParameters:

max_rpc_cnt=150,defer

HTH,
John DeSantis

On Thu, 11 Jan 2018 16:39:43 -0500
Colas Rivière <riviere at umdgrb.umd.edu> wrote:

> Hello,
> 
> I'm managing a small cluster (one head node, 24 workers, 1160 total 
> worker threads). The head node has two E5-2680 v3 CPUs
> (hyper-threaded), ~100 GB of memory and spinning disks.
> The head node becomes occasionally less responsive when there are
> more than 10k jobs in queue, and becomes really unmanageable when
> reaching 100k jobs in queue, with error messages such as:
> > sbatch: error: Slurm temporarily unable to accept job, sleeping and 
> > retrying.
> or
> > Running: slurm_load_jobs error: Socket timed out on send/recv
> > operation
> Is that normal to experience slowdowns when the queue reaches this
> few 10k jobs? What limit should I expect? Would adding a SSD drive
> for SlurmdSpoolDir help? What can be done to push this limit?
> 
> The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached 
> (from `scontrol show config`).
> 
> Thanks,
> Colas