[slurm-users] Slurm database error messages (redux)
Will Dennis
wdennis at nec-labs.com
Tue May 7 15:01:05 UTC 2019
Hi all,
We had to restart the slurmdbd service on one of our clusters running Slurm 17.11.7 yesterday, since folks were experiencing errors with job scheduling, and running 'sacct':
-----
$ sacct -X -p -o jobid,jobname,user,partition%-30,nodelist,alloccpus,reqmem,cputime,qos,state,exitcode,AllocTRES%-50 -s R --allusers
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to captain1:6819: Connection refused
sacct: error: slurmdbd: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
-----
Looking in the logs post-restart, I see a large number of messages such as these:
-----
[2019-05-07T07:35:17.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:17.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:35.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:35.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:53.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:35:53.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:36:11.000] debug2: DBD_MODIFY_RESV: called
[2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
[2019-05-07T07:36:11.001] error: There is no reservation by id 4, time_start 1555628209, and cluster 'rescluster'
-----
I read today's list message entitled "Slurm database failure messages", and although different, I saw that there was a linked bug report that had to do with problems with reservations. It suggested gathering data via three commands, the output of which from our cluster are seen here:
-----
root at captain1:/var/log# scontrol show reservations
ReservationName=res17-pc2 StartTime=2019-02-25T14:58:40 EndTime=2029-02-22T14:58:40 Duration=3650-00:00:00
Nodes=res17-pc2 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=desktops Flags=SPEC_NODES
TRES=cpu=12
Users=samuel Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
ReservationName=res18-pc5 StartTime=2019-04-25T11:47:05 EndTime=2020-04-24T11:47:05 Duration=365-00:00:00
Nodes=res18-pc5 NodeCnt=1 CoreCnt=6 Features=(null) PartitionName=(null) Flags=SPEC_NODES
TRES=cpu=12
Users=grv Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
root at captain1:/var/log# sacctmgr show reservations
Cluster Name TRES TimeStart TimeEnd UnusedWall
---------- --------------- ------------------------------ ------------------- ------------------- ----------
rescluster res17-pc2 cpu=12 2019-04-24T13:29:52 2029-02-22T14:58:40 0.000000
mysql> select * from rescluster_resv_table\G
*************************** 1. row ***************************
id_resv: 1
deleted: 1
assoclist: 12
flags: 65535
nodelist: res17-pc2,captain2,server13k,server15k,server25k
node_inx: 0-4
resv_name: res17-pc2
time_start: 1551135476
time_end: 1551135512
tres: 1=140
unused_wall: 36
*************************** 2. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551135520
time_end: 1551141705
tres: 1=12
unused_wall: 6176.5
*************************** 3. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551141705
time_end: 1551734095
tres: 1=12
unused_wall: 581590
*************************** 4. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551734095
time_end: 1551847812
tres: 1=12
unused_wall: 117173.666667
*************************** 5. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1551847812
time_end: 1552353438
tres: 1=12
unused_wall: 480521
*************************** 6. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1552353438
time_end: 1554771615
tres: 1=12
unused_wall: 2367043
*************************** 7. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1554771615
time_end: 1556137792
tres: 1=12
unused_wall: 2006236
*************************** 8. row ***************************
id_resv: 2
deleted: 0
assoclist: 12
flags: 32768
nodelist: res17-pc2
node_inx: 0
resv_name: res17-pc2
time_start: 1556137792
time_end: 1866495520
tres: 1=12
unused_wall: 0
8 rows in set (0.00 sec)
-----
So it seems to me that the reservations are messed up; how to go about fixing this?
Thanks in advance for any help provided...
More information about the slurm-users
mailing list