[slurm-users] Agent message queue growth bug: some analysis, what do you think?

Fri Aug 16 19:30:19 UTC 2019

Folks,

We have a Slurm cluster (version 18.06-2) with many nodes, and are
frequently running into the “agent message queue gets longer” issue.  After
reviewing bug reports with similar symptoms (like
https://bugs.schedmd.com/show_bug.cgi?id=5147), and studying the code, I've
come to the conclusion that there seems to be an actual bug in the
codebase.  It doesn't appear to be fixed in 19.05.

Since I'm personally new to Slurm, I thought I'd post and see if anyone who
knows more than me can weigh in.

Here is our scenario:
1. A lot of jobs (hundreds?) go into COMPLETING state within a short period
of time.
2. The number of messages in the agent message queue (retry_list) keeps
increasing until it gets to thousands and never recovers (so we reboot the
controller)
3. The number of agent threads is never very big (reported by sdiag).
Sometimes it is just 2 or 4.  At most, it is around 35.  In no case does
the number of agent threads reflect the algorithm that is in the code
(which is that up to 256-number of RPC threads-(10 or 12) threads can be
run to dispatch messages if you have hundreds of messages to send.
4. The total number of messages actually delivered to the nodes (based on
the debug logs) is around 600 per minute (not surprising given only a few
threads sending the messages).
5. The scheduler wakes up every minute and enqueues TERMINATE_JOB messages
to all the jobs in COMPLETING state to remind the nodes to kill them (or
report they are done).  There are hundreds of these, so: if the number of
messages being enqueue exceeds the number of messages being delivered, the
queue just keeps increasing.
6.  Eventually the queue is mostly TERMINATE_JOB messages and all other
control messages (eg START JOB) are a minority of the queue.  So jobs don't
start.

Looking at the background (bug database, support requests) this seems to be
a well understood scenario, however, the triggering events all seem to be
related to other bugs (eg https://bugs.schedmd.com/show_bug.cgi?id=5111),
so it seems the agent queue problem remains undiagnosed.

With that as introduction, here is what I see happening:

In agent.c, we have the main agent loop (with a lot of code elided out with
...):

/* Start a thread to manage queued agent requests */
static void *_agent_init(void *arg)
{
  int min_wait;
  bool mail_too;
  struct timespec ts = {0, 0};

  while (true) {
    slurm_mutex_lock(&pending_mutex);
    while (!... (pending_wait_time == NO_VAL16)) {
      ts.tv_sec  = time(NULL) + 2;
      slurm_cond_timedwait(&pending_cond, &pending_mutex,
        &ts)1;
    }
    ...
    min_wait = pending_wait_time;
    pending_mail = false;
    pending_wait_time = NO_VAL16;
    slurm_mutex_unlock(&pending_mutex);

    _agent_retry(min_wait, mail_too);
  }
  ...
}

You can think of the above as a "consumer" loop in a "producer-consumer"
pattern, where we wait until the condition variable (pending_cond) is
signaled, get a message from the queue and dispatch it, and go again.

The producer side of the pattern looks like this:

void agent_queue_request(agent_arg_t *agent_arg_ptr)
{
   ...
    list_append(retry_list, (void *)queued_req_ptr);
   ...
    agent_trigger(999, false);
}

and

extern void agent_trigger(int min_wait, bool mail_too)
{
  slurm_mutex_lock(&pending_mutex);
  if ((pending_wait_time == NO_VAL16) ||
       (pending_wait_time >  min_wait))
    pending_wait_time = min_wait;
  if (mail_too)
    pending_mail = mail_too;
  slurm_cond_broadcast(&pending_cond);
  slurm_mutex_unlock(&pending_mutex);
}

Here is the problem:  the consumer side (the _agent_init() loop) consumes
only one message each time around the loop (that's what _agent_retry()
does), regardless of how many messages are in the queue.

However, it is possible that more than one message gets added to the queue
during that time.  For example, suppose _agent_init() is waiting on the
slurm_cond_timedwait(), and the scheduler thread enqueues a lot of
TERMINATE_JOB messages.  Of course enqueuing those messages signals the
pending condition, but that doesn't guarantee when the _agent_init() thread
wakes up.  So many messages could be added to the queue before the
slurm_cond_timedwait() returns.

Only one messages will be dispatched by _agent_retry(), regardless of how
many were added by the scheduler.

In _agent_init(), the last thing that we do before unlocking the mutex is
to reset pending_wait_time too NO_VAL16.  The only other place
pending_wait_time is set to another value is in agent_trigger.  If no other
thread enqueues messages while _agent_retry() is running, the value of
pending_wait_time will still be NO_VAL16 when we return to the top of the
loop, which means we will always enter the slurm_cond_timedwait() and wait
for the next signal--even if there are still messages in the queue.

But the next signal will arrive only when the a new message is enqueued by
some other thread.  That might take a while, but it will probably happen
eventually since the system is constantly passing messages around.
However, queue will never really empty out--it will send one message if
another message is enqueued.

TLDR: The message queue keeps growing because the consumer side of the
producer-consumer pattern is not guaranteed to consume all messages that
are enqueued before it returns to waiting.

P.S. I'm aware that _agent_retry is a rather complicated piece of code and
it is possible that there are messages in that queue that should not be
dispatched yet, but that is kind of a different issue.

P.P.S. What's rather telling is this comment in controller.c:2135:
/* Process any pending agent work */
agent_trigger(RPC_RETRY_INTERVAL, true);
That suggests that at one time, someone thought that agent_trigger() was
supposed to clear out the queue.  Was that true 12 years ago but is not
true now?

Thanks for your attention, and if you think it is fruitful I'll submit a
PR, but would welcome to hear if you think I've got something wrong.

Cheers,
Conrad Herrmann
Sr. Staff Engineer
Zoox
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190816/834df957/attachment.htm>