[slurm-users] SLURM 17.02.9 slurmctld unresponsive with server_thread_count over limit, waiting in syslog

Wed Nov 8 15:06:58 MST 2017

Thanks, Paul. I've played with SchedulerParameters=defer,... in and out of
the configuration per various suggestions in various SLURM bug tracker
threads that I looked at, but this was probably when we were still focusing
on trying to get sched/backfill playing ball. I will try again now that
we're back to sched/builtin and see if that helps. If we still see issues,
I'll look at a potential DBD performance bottleneck and try some of those
MySQL tuning suggestions I've seen in the docs.

Best,

Sean

On Wed, Nov 8, 2017 at 3:57 PM, Paul Edmon <pedmon at cfa.harvard.edu> wrote:

> So hangups like this can occur due to the slurmdbd being busy with
> requests.  I've seen that happen when an ill timed massive sacct request
> hits when slurmdbd is doing its roll up.  In that case the slurmctld hangs
> while slurmdbd is busy.  Typically in this case restarting mysql/slurmdbd
> seems to fix the issue.
>
> Otherwise this can happen due to massive traffic to the slurmctld.  You
> can try using the defer option for the SchedulerParamters.  That slows down
> the scheduler so it can handle the additional load.
>
> -Paul Edmon-
>
>
>
> On 11/8/2017 3:11 PM, Sean Caron wrote:
>
>> Hi all,
>>
>> I see SLURM 17.02.9 slurmctld hang or become unresponsive every few days
>> with the message in syslog:
>>
>> server_thread_count over limit (256), waiting
>>
>> I believe from the user perspective they see "Socket timed out on
>> send/recv operation". Slurmctld never seems to recover once it's in this
>> state and will not respond to /etc/init.d/slurm restart. Only after an
>> admin does a kill -9 and restarts slurmctld does it snap back.
>>
>> I don't see anything else in the logs that looks like an error message
>> that would help diagnose what is going on, even with log level debug3 on
>> the SLURM controller daemon.
>>
>> I monitor CPU and memory utilization with "htop" on the machine running
>> the controller daemon and it doesn't seem like it's overwhelmed by
>> slurmctld load or anything like that.
>>
>> Machine running the controller daemon feels reasonable for the task, for
>> the size of our cluster. It's a repurposed Dell PowerEdge R410 with 24
>> threads and 32 GB physical. Unless I'm way off?
>>
>> I tried all kinds of SchedulerParameter tweaks on sched/backfill and even
>> set the scheduler back to sched/builtin and it's still happening. Didn't
>> seem to affect the frequency much, either.
>>
>> Any thoughts what could be causing SLURM to spawn so many threads and
>> hang up?
>>
>> Our cluster is medium-sized, we probably have a few thousand jobs in the
>> queue on average at any given time.
>>
>> Monitoring with sdiag, the max cycle time of main scheduler never cracks
>> 2 seconds. This seems reasonable?
>>
>> Thanks,
>>
>> Sean
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171108/d0a044ee/attachment.html>