[slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

Christopher Benjamin Coffey Chris.Coffey at nau.edu
Tue Apr 24 13:59:38 MDT 2018


We've gotten around the issue where we could not remove the runaway jobs. We had to go the manual route of manipulating the db directly. We actually used a great script that Loris Bennet wrote a while back. I haven't had to use it for a long while - thanks again! :)

An item of interest for the developers ... there seems to be a limit that we had exceeded that the "sacctmgr show runawayjobs" command or associated command could handle. After fixing a good portion of the jobs (~32K left) , "sacctmgr show runawayjobs" command did not display this error:

sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error     
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

And instead gave the dialog as normal "would you like to remove them ... (y/n)" , etc"

The limit appears to be here:

common/slurm_persist_conn.c
...
#define MAX_MSG_SIZE     (16*1024*1024)
...

I wonder if that should be increased? It's probably not normal to have 56K runaway jobs, but still likely worthwhile addressing.

Anyhoo, it seems things are back to normal as far as we can tell. We will be looking into providing faster storage for the db but does it seem reasonable for slurm to crash under the circumstances that I mentioned?


Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 4/24/18, 10:20 AM, "slurm-users on behalf of Christopher Benjamin Coffey" <slurm-users-bounces at lists.schedmd.com on behalf of Chris.Coffey at nau.edu> wrote:

    Hi, we have an issue currently where we have a bunch (56K) of runaway jobs, but we cannot clear them:
    
    sacctmgr show runaway|wc -l
    sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
    sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable
    
    58588
    
    Has anyone run into this? We've tried, restarting slurmdbd, slurmctl, mysql, etc. but does not help.
    
    Slurmdbd log shows the following when the "sacctmgr show runawayjobs" command is run.
    
    [2018-04-24T07:56:03.869] error: Invalid msg_size (31302621) from connection 12(172.16.2.1) uid(3510)
    [2018-04-24T07:56:03.872] error: Invalid msg_size (31302621) from connection 7(172.16.2.1) uid(3510)
    [2018-04-24T07:56:03.874] error: Invalid msg_size (31302621) from connection 12(172.16.2.1) uid(3510)
    [2018-04-24T07:56:03.875] error: Invalid msg_size (31302621) from connection 7(172.16.2.1) uid(3510)
    [2018-04-24T07:56:03.877] error: Invalid msg_size (31302621) from connection 12(172.16.2.1) uid(3510)
    
    Seems to indicate that possibly there are too many runaway jobs needing to be cleared? I wonder if there is a way to select a fewer number for removal. Don't see that option however.
    
    This all started last week when slurm crashed due to being seriously hammered by a user submitting 500K 2min jobs. Slurmdbd appeared to not be able to handle all the transactions that slurmctl was sending it:
    
    ...
    [2018-04-15T20:16:35.021] slurmdbd: agent queue size 100
    [2018-04-16T11:54:29.312] slurmdbd: agent queue size 200
    [2018-04-18T17:53:22.339] slurmdbd: agent queue size 19100
    [2018-04-18T17:59:58.413] slurmdbd: agent queue size 64100
    [2018-04-18T18:06:10.143] slurmdbd: agent queue size 104300
    ...
    
    ...
    [2018-04-18T18:20:37.597] error: slurmdbd: agent queue filling (200214), RESTART SLURMDBD NOW
    ...
    
    ...
    error: slurmdbd: Sending fini msg: No error
    ...
    
    So now at this point, lots and lots of our nodes are idle, but slurm is not starting jobs.
    
    [cbc at siris ~ ]$ sreport cluster utilization
    --------------------------------------------------------------------------------
    Cluster Utilization 2018-04-23T00:00:00 - 2018-04-23T23:59:59
    Usage reported in CPU Minutes
    --------------------------------------------------------------------------------
      Cluster Allocated     Down PLND Dow     Idle Reserved  Reported 
    --------- --------- -------- -------- -------- -------- --------- 
      monsoon   4216320        0        0        0        0   4216320
    
    
    sreport shows the entire cluster fully utilized, yet this is not the case.
    
    I see that there is a fix for runaway jobs in version 17.11.5:
    
    -- sacctmgr - fix runaway jobs identification.
    
    We upgraded to 17.11.5 this morning but still we cannot clear the runaway jobs. I wonder if we'll need to manually remove them with some mysql foo. We are investigating this now.
    
    Hope maybe someone has run into this.
    
    Thanks,
    Chris
    —
    
    —
    Christopher Coffey
    High-Performance Computing
    Northern Arizona University
    928-523-1167
     
    
    



More information about the slurm-users mailing list