[slurm-users] Queue size, slow/unresponsive head node
John DeSantis
desantis at usf.edu
Fri Jan 12 07:08:01 MST 2018
Colas,
We had a similar experience a long time ago, and we solved it by adding
the following SchedulerParameters:
max_rpc_cnt=150,defer
HTH,
John DeSantis
On Thu, 11 Jan 2018 16:39:43 -0500
Colas Rivière <riviere at umdgrb.umd.edu> wrote:
> Hello,
>
> I'm managing a small cluster (one head node, 24 workers, 1160 total
> worker threads). The head node has two E5-2680 v3 CPUs
> (hyper-threaded), ~100 GB of memory and spinning disks.
> The head node becomes occasionally less responsive when there are
> more than 10k jobs in queue, and becomes really unmanageable when
> reaching 100k jobs in queue, with error messages such as:
> > sbatch: error: Slurm temporarily unable to accept job, sleeping and
> > retrying.
> or
> > Running: slurm_load_jobs error: Socket timed out on send/recv
> > operation
> Is that normal to experience slowdowns when the queue reaches this
> few 10k jobs? What limit should I expect? Would adding a SSD drive
> for SlurmdSpoolDir help? What can be done to push this limit?
>
> The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
> (from `scontrol show config`).
>
> Thanks,
> Colas
More information about the slurm-users
mailing list