[slurm-users] SLURM 17.02.9 slurmctld unresponsive with server_thread_count over limit, waiting in syslog

Wed Nov 8 13:11:32 MST 2017

Hi all,

I see SLURM 17.02.9 slurmctld hang or become unresponsive every few days
with the message in syslog:

server_thread_count over limit (256), waiting

I believe from the user perspective they see "Socket timed out on send/recv
operation". Slurmctld never seems to recover once it's in this state and
will not respond to /etc/init.d/slurm restart. Only after an admin does a
kill -9 and restarts slurmctld does it snap back.

I don't see anything else in the logs that looks like an error message that
would help diagnose what is going on, even with log level debug3 on the
SLURM controller daemon.

I monitor CPU and memory utilization with "htop" on the machine running the
controller daemon and it doesn't seem like it's overwhelmed by slurmctld
load or anything like that.

Machine running the controller daemon feels reasonable for the task, for
the size of our cluster. It's a repurposed Dell PowerEdge R410 with 24
threads and 32 GB physical. Unless I'm way off?

I tried all kinds of SchedulerParameter tweaks on sched/backfill and even
set the scheduler back to sched/builtin and it's still happening. Didn't
seem to affect the frequency much, either.

Any thoughts what could be causing SLURM to spawn so many threads and hang
up?

Our cluster is medium-sized, we probably have a few thousand jobs in the
queue on average at any given time.

Monitoring with sdiag, the max cycle time of main scheduler never cracks 2
seconds. This seems reasonable?

Thanks,

Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171108/46f4cf57/attachment.html>