[slurm-users] sacct issue: jobs staying in "RUNNING" state

Brian W. Johanson bjohanso at psc.edu
Wed Jul 17 14:44:23 UTC 2019


On 7/17/19 12:26 AM, Chris Samuel wrote:
> On 16/7/19 11:43 am, Will Dennis wrote:
>
>> [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140), 
>> discarding DBD_STEP_START:1442 request
>
> So it looks like your slurmdbd cannot keep up with the rate of these incoming 
> steps and is having to throw away messages.
>
>> [2019-07-16T09:40:27.515] error: slurmdbd: agent queue filling (20140), 
>> RESTART SLURMDBD NOW
>
> Have you tried doing what it told you to?
>
> You may want to look at the performance of you MySQL server to see if it's 
> failing to keep up with what slurmdbd is asking it to do.
>
> All the best,
> Chris

Once you have the database performance issues addressed, sacctmgr can clean up 
the entries for completed jobs listed as running.
'sacctmgr list/show runawayjobs'

RunawayJobs
               Used only with the list or show command to report current jobs 
that have been orphanded on the local cluster and are now runaway.  If there are 
jobs in this state it will also give you an option to "fix" them.




More information about the slurm-users mailing list