[slurm-users] "We have more time than is possible" in slurmdbd.log with no runaway jobs

Wed Feb 6 15:33:07 UTC 2019

Hi All

seeing this after some hours of mysql downtime yesterday to correct
something else but i didn't notice these  errors until after I had
performed the Slurm update to 18.08 which went through fine in spite of
these errors

firstly when restarting the slurmdbd before I started the update

[2019-02-06T11:28:44.398] slurmdbd version 17.02.7 started
[2019-02-06T11:28:46.194] error: We have more time than is possible
(4536000+6566400+0)(11102400) > 10886400 for cluster cluster(3024) from
2019-02-06T07:00:00 - 2019-02-06T08:00:00 tres 1
[2019-02-06T11:28:46.199] error: We have more time than is possible
(4536000+6566400+0)(11102400) > 10886400 for cluster cluster(3024) from
2019-02-06T08:00:00 - 2019-02-06T09:00:00 tres 1
[2019-02-06T11:28:46.204] error: We have more time than is possible
(4536000+6566400+0)(11102400) > 10886400 for cluster cluster(3024) from
2019-02-06T09:00:00 - 2019-02-06T10:00:00 tres 1
[2019-02-06T11:28:46.210] error: We have more time than is possible
(4031100+7070700+0)(11101800) > 10886400 for cluster cluster(3024) from
2019-02-06T10:00:00 - 2019-02-06T11:00:00 tres 1

first I spotted it was here
[2019-02-06T12:23:50.276] Conversion done: success!
[2019-02-06T12:23:50.281] Accounting storage MYSQL plugin loaded
[2019-02-06T12:23:50.734] slurmdbd version 18.08.4 started
[2019-02-06T12:23:50.765] error: We have more time than is possible
(3456000+11911388+0)(15367388) > 15336000 for cluster cluster(4624) from
2019-02-06T11:00:00 - 2019-02-06T12:00:00 tres 1

and now it repeats every hour
[2019-02-06T13:00:00.186] error: We have more time than is possible
(3456000+13219200+0)(16675200) > 16646400 for cluster cluster(4624) from
2019-02-06T12:00:00 - 2019-02-06T13:00:00 tres 1
[2019-02-06T14:00:00.283] error: We have more time than is possible
(3456000+13212800+0)(16668800) > 16646400 for cluster cluster(4624) from
2019-02-06T13:00:00 - 2019-02-06T14:00:00 tres 1
[2019-02-06T15:00:00.369] error: We have more time than is possible
(3456000+13219200+0)(16675200) > 16646400 for cluster cluster(4624) from
2019-02-06T14:00:00 - 2019-02-06T15:00:00 tres 1

15:59:45 [root ~]# sacctmgr list runawayjobs
Runaway Jobs: No runaway jobs found on cluster cluster

and just because of the convenient timing

16:04:31 [root ~]# tail /var/log/slurm/slurmdbd.log -n 1
[2019-02-06T16:00:00.917] error: We have more time than is possible
(3456000+13219200+0)(16675200) > 16646400 for cluster cluster(4624) from
2019-02-06T15:00:00 - 2019-02-06T16:00:00 tres 1

There are 5 jobs that have been running throughout and are yet to complete.
Is it possible this will stop  when they have. What else could be causing
this?

Thanks

Antony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190206/ddf72bf6/attachment-0001.html>