[slurm-users] Slurmd Stops responding with MAX_THREADS message logged

Grant Campbell grant.campbell at mythic-ai.com
Thu Sep 24 19:45:19 UTC 2020


About once a day one or more Slurmd daemons running in our cluster
stop accepting new jobs, and they only recover when Slurmd is
restarted.  The nodes are marked as "down", with the reason given as
"not responding".  We are running version 20.02.0. Right at the time
this issue occurs the Slurmd process logs the below message:
[2020-09-21T10:03:35.480] active_threads == MAX_THREADS(256)
If you strace the Slurmd process it seems to be waiting on a futext sys call:
strace -p 577918 -t -vv
strace: Process 577918 attached
11:51:15 futex(0x63a98c, FUTEX_WAIT_PRIVATE, 0, NULL
Reading through the source it seems when the message is logged some
mutex operation runs, so I'm curious if Slurm could be getting stuck
on either acquiring or releasing a lock? Has anyone encountered this

Any help is much appreciated!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200924/63de64ad/attachment.htm>

More information about the slurm-users mailing list