[slurm-users] Runaway jobs issue, slurm 17.11.3

Christopher Benjamin Coffey Chris.Coffey at nau.edu
Mon Apr 23 17:24:10 MDT 2018


Hi, we have an issue currently where we have a bunch of runaway jobs, but we cannot clear them:

sacctmgr show runaway|wc -l
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

58588

Has anyone run into this? We've tried, restarting slurmdbd, slurmctl, mysql, etc. but does not help.

This all started last week when slurm crashed due to being seriously hammered by a user submitting 500K 2min jobs. Slurmdbd appeared to not be able to handle all the transactions that slurmctl was sending it:

...
[2018-04-15T20:16:35.021] slurmdbd: agent queue size 100
[2018-04-16T11:54:29.312] slurmdbd: agent queue size 200
[2018-04-18T17:53:22.339] slurmdbd: agent queue size 19100
[2018-04-18T17:59:58.413] slurmdbd: agent queue size 64100
[2018-04-18T18:06:10.143] slurmdbd: agent queue size 104300
...

...
[2018-04-18T18:20:37.597] error: slurmdbd: agent queue filling (200214), RESTART SLURMDBD NOW
...

...
error: slurmdbd: Sending fini msg: No error
...

So now at this point, lots and lots of our nodes are idle, but slurm is not starting jobs. I'm thinking maybe slurmctl is confused and thinks all of those runaway jobs are still running.

I see that there is a fix for runaway jobs in version 17.11.5:

-- sacctmgr - fix runaway jobs identification.

Thinking about upgrading to see if this will fix our issue.

Hope maybe someone has run into this.

Thanks,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



More information about the slurm-users mailing list