[slurm-users] sacct issue: jobs staying in "RUNNING" state
Brian W. Johanson
bjohanso at psc.edu
Wed Jul 17 14:44:23 UTC 2019
On 7/17/19 12:26 AM, Chris Samuel wrote:
> On 16/7/19 11:43 am, Will Dennis wrote:
>
>> [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140),
>> discarding DBD_STEP_START:1442 request
>
> So it looks like your slurmdbd cannot keep up with the rate of these incoming
> steps and is having to throw away messages.
>
>> [2019-07-16T09:40:27.515] error: slurmdbd: agent queue filling (20140),
>> RESTART SLURMDBD NOW
>
> Have you tried doing what it told you to?
>
> You may want to look at the performance of you MySQL server to see if it's
> failing to keep up with what slurmdbd is asking it to do.
>
> All the best,
> Chris
Once you have the database performance issues addressed, sacctmgr can clean up
the entries for completed jobs listed as running.
'sacctmgr list/show runawayjobs'
RunawayJobs
Used only with the list or show command to report current jobs
that have been orphanded on the local cluster and are now runaway. If there are
jobs in this state it will also give you an option to "fix" them.
More information about the slurm-users
mailing list