[slurm-users] Queue size, slow/unresponsive head node
riviere at umdgrb.umd.edu
Fri Jan 12 12:00:58 MST 2018
> Why do you have?
> SchedulerParameters = (null)
I did not set these parameters, so I assume "(null)" means all the
default values are used.
thanks, I'll try that, and look into these SchedulerParameter more.
On 2018-01-12 09:08, John DeSantis wrote:
> We had a similar experience a long time ago, and we solved it by adding
> the following SchedulerParameters:
> John DeSantis
> On Thu, 11 Jan 2018 16:39:43 -0500
> Colas Rivière <riviere at umdgrb.umd.edu> wrote:
>> I'm managing a small cluster (one head node, 24 workers, 1160 total
>> worker threads). The head node has two E5-2680 v3 CPUs
>> (hyper-threaded), ~100 GB of memory and spinning disks.
>> The head node becomes occasionally less responsive when there are
>> more than 10k jobs in queue, and becomes really unmanageable when
>> reaching 100k jobs in queue, with error messages such as:
>>> sbatch: error: Slurm temporarily unable to accept job, sleeping and
>>> Running: slurm_load_jobs error: Socket timed out on send/recv
>> Is that normal to experience slowdowns when the queue reaches this
>> few 10k jobs? What limit should I expect? Would adding a SSD drive
>> for SlurmdSpoolDir help? What can be done to push this limit?
>> The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
>> (from `scontrol show config`).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users