[slurm-users] sacct issue: jobs staying in "RUNNING" state

Will Dennis wdennis at nec-labs.com
Wed Jul 17 16:55:30 UTC 2019


Not thinking that the server (which runs both the Slurm controller daemon and the DB) is the issue... It's a Dell PowerEdge R430 platform, with dual Intel Xeon E5-2640v3 CPUs and 256GB memory, and RAID-1 array of 1TB SATA disks. 

top - 09:29:26 up 101 days, 14:57,  3 users,  load average: 0.06, 0.02, 0.00
Tasks: 421 total,   1 running, 241 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26392008+total, 25892228+free,   784884 used,  4212904 buff/cache
KiB Swap:   999420 total,   999420 free,        0 used. 26091148+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  331 mysql     20   0 2096608 351552  19752 S   0.3  0.1  63:15.09 mysqld
28476 slurm     20   0 3800496  20940   5320 S   0.7  0.0  10:58.34 slurmctld
[...]


mytop output:
MySQL on localhost (5.7.26)           load 0.00 0.02 0.00 1/523 20112 up 77+03:04:52 [09:51:44]
 Queries: 3.0M     qps:    0 Slow:     0.0         Se/In/Up/De(%):    69/02/01/00
 Sorts:      0 qps now:    2 Slow qps: 0.0  Threads:    3 (   1/   5) 25/00/00/00
 Key Efficiency: 80.4%  Bps in/out:  97.0/481.3   Now in/out: 118.6/ 2.9k

       Id      User         Host/IP         DB       Time    Cmd    State Query
       --      ----         -------         --       ----    ---    ----- ----------
     2790     slurm       localhost slurm_acct        393  Sleep
     2792     slurm       localhost slurm_acct          1  Sleep
     2786      root       localhost                     0  Query starting show full processlist


SHOW ENGINE INNODB STATUS output:
=====================================
2019-07-17 09:34:17 0x7f4018226700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 14 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 11655 srv_active, 0 srv_shutdown, 6650270 srv_idle
srv_master_thread log flush and writes: 6659931
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 24514
OS WAIT ARRAY INFO: signal count 26260
RW-shared spins 0, rounds 51554, OS waits 23964
RW-excl spins 0, rounds 41462, OS waits 260
RW-sx spins 1279, rounds 18628, OS waits 154
Spin rounds per wait: 51554.00 RW-shared, 41462.00 RW-excl, 14.56 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 524318
Purge done for trx's n:o < 524193 undo n:o < 0 state: running but idle
History list length 5
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421388533043752, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421388533041912, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421388533042832, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421388533040992, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
 ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
1135 OS file reads, 416077 OS file writes, 105781 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 31 merges
merged operations:
 insert 119, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 34673, node heap has 4 buffer(s)
Hash table size 34673, node heap has 2 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 14 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 4 buffer(s)
0.07 hash searches/s, 0.71 non-hash searches/s
---
LOG
---
Log sequence number 126853542
Log flushed up to   126853542
Pages flushed up to 126853542
Last checkpoint at  126853533
0 pending log flushes, 0 pending chkp writes
62541 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137428992
Dictionary memory allocated 330655
Buffer pool size   8191
Free buffers       3112
Database pages     5051
Old database pages 1844
Modified db pages  0
Pending reads      0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 29, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1073, created 3978, written 338632
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 5051, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
Process ID=331, Main thread ID=139913181783808, state: sleeping
Number of rows inserted 2092221, updated 153579, deleted 1501, read 2793036658
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.21 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT


I don't immediately see any performance problems with either the OS or MySQL...



-----Original Message----- 
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Brian W. Johanson
Sent: Wednesday, July 17, 2019 10:44 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state


On 7/17/19 12:26 AM, Chris Samuel wrote:
> On 16/7/19 11:43 am, Will Dennis wrote:
>
>> [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full 
>> (20140), discarding DBD_STEP_START:1442 request
>
> So it looks like your slurmdbd cannot keep up with the rate of these 
> incoming steps and is having to throw away messages.
>
>> [2019-07-16T09:40:27.515] error: slurmdbd: agent queue filling 
>> (20140), RESTART SLURMDBD NOW
>
> Have you tried doing what it told you to?
>
> You may want to look at the performance of you MySQL server to see if 
> it's failing to keep up with what slurmdbd is asking it to do.
>
> All the best,
> Chris

Once you have the database performance issues addressed, sacctmgr can clean up the entries for completed jobs listed as running.
'sacctmgr list/show runawayjobs'

RunawayJobs
               Used only with the list or show command to report current jobs that have been orphanded on the local cluster and are now runaway.  If there are jobs in this state it will also give you an option to "fix" them.





More information about the slurm-users mailing list