<div dir="ltr"><div>Hi all,</div><div><br></div><div>I see SLURM 17.02.9 slurmctld hang or become unresponsive every few days with the message in syslog:</div><div><br></div><div>server_thread_count over limit (256), waiting</div><div><br></div><div>I believe from the user perspective they see "Socket timed out on send/recv operation". Slurmctld never seems to recover once it's in this state and will not respond to /etc/init.d/slurm restart. Only after an admin does a kill -9 and restarts slurmctld does it snap back.</div><div><br></div><div>I don't see anything else in the logs that looks like an error message that would help diagnose what is going on, even with log level debug3 on the SLURM controller daemon.</div><div><br></div><div>I monitor CPU and memory utilization with "htop" on the machine running the controller daemon and it doesn't seem like it's overwhelmed by slurmctld load or anything like that.</div><div><br></div><div>Machine running the controller daemon feels reasonable for the task, for the size of our cluster. It's a repurposed Dell PowerEdge R410 with 24 threads and 32 GB physical. Unless I'm way off?</div><div><br></div><div>I tried all kinds of SchedulerParameter tweaks on sched/backfill and even set the scheduler back to sched/builtin and it's still happening. Didn't seem to affect the frequency much, either.</div><div><br></div><div>Any thoughts what could be causing SLURM to spawn so many threads and hang up? </div><div><br></div><div>Our cluster is medium-sized, we probably have a few thousand jobs in the queue on average at any given time. </div><div><br></div><div>Monitoring with sdiag, the max cycle time of main scheduler never cracks 2 seconds. This seems reasonable?</div><div><br></div><div>Thanks,</div><div><br></div><div>Sean</div><div><br></div></div>