[slurm-users] Slurm database error messages (redux)

Tue May 7 15:01:05 UTC 2019

Hi all,

We had to restart the slurmdbd service on one of our clusters running Slurm 17.11.7 yesterday, since folks were experiencing errors with job scheduling, and running 'sacct':

-----
$ sacct -X -p -o jobid,jobname,user,partition%-30,nodelist,alloccpus,reqmem,cputime,qos,state,exitcode,AllocTRES%-50 -s R --allusers                         
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to captain1:6819: Connection refused
sacct: error: slurmdbd: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
-----

Looking in the logs post-restart, I see a large number of messages such as these:

-----
[2019-05-07T07:35:17.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:35.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:53.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:36:11.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
-----

I read today's list message entitled "Slurm database failure messages", and although different, I saw that there was a linked bug report that had to do with problems with reservations. It suggested gathering data via three commands, the output of which from our cluster are seen here:

-----
root at captain1:/var/log# scontrol show reservations
ReservationName=res17-pc2 StartTime=2019-02-25T14:58:40 EndTime=2029-02-22T14:58:40 Duration=3650-00:00:00
   Nodes=res17-pc2 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=desktops Flags=SPEC_NODES
   TRES=cpu=12
   Users=samuel Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=res18-pc5 StartTime=2019-04-25T11:47:05 EndTime=2020-04-24T11:47:05 Duration=365-00:00:00
   Nodes=res18-pc5 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=12
   Users=grv Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

root at captain1:/var/log# sacctmgr show reservations
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall
---------- --------------- ------------------------------ ------------------- ------------------- ----------
 rescluster        res17-pc2                         cpu=12 2019-04-24T13:29:52 2029-02-22T14:58:40   0.000000

mysql> select * from rescluster_resv_table\G
*************************** 1. row ***************************
    id_resv: 1
    deleted: 1
  assoclist: 12
      flags: 65535
   nodelist: res17-pc2,captain2,server13k,server15k,server25k
   node_inx: 0-4
  resv_name: res17-pc2
 time_start: 1551135476
   time_end: 1551135512
       tres: 1=140
unused_wall: 36
*************************** 2. row ***************************
    id_resv: 2
    deleted: 0
  assoclist: 12
      flags: 32768
   nodelist: res17-pc2
   node_inx: 0
  resv_name: res17-pc2
 time_start: 1551135520
   time_end: 1551141705
       tres: 1=12
unused_wall: 6176.5
*************************** 3. row ***************************
    id_resv: 2
    deleted: 0
  assoclist: 12
      flags: 32768
   nodelist: res17-pc2
   node_inx: 0
  resv_name: res17-pc2
 time_start: 1551141705
   time_end: 1551734095
       tres: 1=12
unused_wall: 581590
*************************** 4. row ***************************
    id_resv: 2
    deleted: 0
  assoclist: 12
      flags: 32768
   nodelist: res17-pc2
   node_inx: 0
  resv_name: res17-pc2
 time_start: 1551734095
   time_end: 1551847812
       tres: 1=12
unused_wall: 117173.666667
*************************** 5. row ***************************
    id_resv: 2
    deleted: 0
  assoclist: 12
      flags: 32768
   nodelist: res17-pc2
   node_inx: 0
  resv_name: res17-pc2
 time_start: 1551847812
   time_end: 1552353438
       tres: 1=12
unused_wall: 480521
*************************** 6. row ***************************
    id_resv: 2
    deleted: 0
  assoclist: 12
      flags: 32768
   nodelist: res17-pc2
   node_inx: 0
  resv_name: res17-pc2
 time_start: 1552353438
   time_end: 1554771615
       tres: 1=12
unused_wall: 2367043
*************************** 7. row ***************************
    id_resv: 2
    deleted: 0
  assoclist: 12
      flags: 32768
   nodelist: res17-pc2
   node_inx: 0
  resv_name: res17-pc2
 time_start: 1554771615
   time_end: 1556137792
       tres: 1=12
unused_wall: 2006236
*************************** 8. row ***************************
    id_resv: 2
    deleted: 0
  assoclist: 12
      flags: 32768
   nodelist: res17-pc2
   node_inx: 0
  resv_name: res17-pc2
 time_start: 1556137792
   time_end: 1866495520
       tres: 1=12
unused_wall: 0
8 rows in set (0.00 sec)
-----

So it seems to me that the reservations are messed up; how to go about fixing this?

Thanks in advance for any help provided...