[slurm-users] sacct end time for failed jobs
Paul Edmon
pedmon at cfa.harvard.edu
Wed Mar 6 18:32:04 UTC 2019
Odds are the new version won't help for that. You will have to do some
mysql work to fix it then.
-Paul Edmon-
On 3/6/2019 1:23 PM, Brian Andrus wrote:
>
> I am running the latest and did that, but it didn't change anything.
> The jobs stay in the runaway state and no changes are made to the
> database.
>
> Using 18.08.2-1.
>
> Maybe try updating to 19.05.0-0pre1?
>
> Brian
>
>
> On 3/6/2019 10:06 AM, Paul Edmon wrote:
>>
>> A lot of this is automated in the new versions of slurm. You should
>> just need to run:
>>
>> sacctmgr show runawayjobs
>>
>> It will then give you an option to clean them and slurm will handle
>> the rest. If you add the -i option it will just clean them
>> automatically.
>>
>> -Paul Edmon-
>>
>> On 3/6/2019 11:58 AM, Cyrus Proctor wrote:
>>>
>>> Hi Brian,
>>>
>>> Others probably have better suggestions before going the route I'm
>>> about to detail. If you do go this route, be warned, you definitely
>>> have the ability to irrevocably lose data or destroy your Slurm
>>> accounting database. Do so at your own risk. I got here with
>>> Google-foo after being out of other (known to me) options. Someone
>>> please save Brian having to do what comes below ;-)
>>>
>>> Last warning: I'd recommend turning off slurmdbd and backing up the
>>> database (mysqldump) before going forward.
>>>
>>> In my case, runaway jobs did not show up with `sacctmgr list
>>> runawayjobs`. My problem was removing a user from the Slurm database
>>> because it thought they still had active jobs. The likely cause of
>>> this was the slurmdb daemon not shutting down gracefully at some
>>> point. The job was long gone but it was still in a pending state:
>>>
>>> # sacct -j 899139
>>> JobID JobName Partition Account AllocCPUS State ExitCode
>>> ------------ ---------- ---------- ---------- ---------- ---------- --------
>>> 899139 equil gpu-long p-1234 20 PENDING 0:0
>>> # scontrol show job 899139
>>> slurm_load_jobs error: Invalid job id specified
>>> # mysql -u root -p
>>> ...
>>> Welcome to the MySQL monitor. Commands end with ; or \g.
>>> Your MySQL connection id is 7453
>>> Server version: 5.1.73 Source distribution
>>>
>>> Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
>>>
>>> Oracle is a registered trademark of Oracle Corporation and/or its
>>> affiliates. Other names may be trademarks of their respective
>>> owners.
>>>
>>> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
>>>
>>> mysql> use slurm_acct_db;
>>> Reading table information for completion of table and column names
>>> You can turn off this feature to get a quicker startup with -A
>>>
>>> Database changed
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | state | time_end | time_start | time_submit | id_assoc | partition |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | 0 | 0 | 0 | 1546880711 | 2078 | gpu-long |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>>
>>> mysql> update banana_job_table set state=3 where id_job=899139;
>>> Query OK, 1 row affected (0.00 sec)
>>> Rows matched: 1 Changed: 1 Warnings: 0
>>>
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | state | time_end | time_start | time_submit | id_assoc | partition |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | 3 | 0 | 0 | 1546880711 | 2078 | gpu-long |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>>
>>> mysql> update banana_job_table set time_start=1546880712 where id_job=899139;
>>> Query OK, 1 row affected (0.00 sec)
>>> Rows matched: 1 Changed: 1 Warnings: 0
>>>
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | state | time_end | time_start | time_submit | id_assoc | partition |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | 3 | 0 | 1546880712 | 1546880711 | 2078 | gpu-long |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>>
>>> mysql> update banana_job_table set time_end=1546880713 where id_job=899139;
>>> Query OK, 1 row affected (0.01 sec)
>>> Rows matched: 1 Changed: 1 Warnings: 0
>>>
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+------------+------------+-------------+----------+-----------+
>>> | state | time_end | time_start | time_submit | id_assoc | partition |
>>> +-------+------------+------------+-------------+----------+-----------+
>>> | 3 | 1546880713 | 1546880712 | 1546880711 | 2078 | gpu-long |
>>> +-------+------------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>> In this case for job ID 899139 on the banana cluster, the state was
>>> not updated and neither were start or end times. I went in and
>>> manually edited the job entries such that Slurm thought they were
>>> complete with feasible start and end times. Again, this worked for
>>> me. I don't know if this is your problem or not. If you choose this
>>> route, be careful and good luck!
>>>
>>> On 3/6/19 10:15 AM, Brian Andrus wrote:
>>>>
>>>> It shows several jobs that all have "Unknown" for end_time. Some
>>>> are PENDING and some are RUNNING (none are truly in either state).
>>>>
>>>> It asked to fix them, which I did, but nothing seems to have
>>>> changed. They still show up with that command and in reports.
>>>>
>>>>
>>>> Brian
>>>>
>>>> On 3/5/2019 10:34 PM, Chris Samuel wrote:
>>>>> On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:
>>>>>
>>>>>> Does anyone have a process they use to handle empty (aka
>>>>>> "Unknown") end
>>>>>> times for jobs that are not running?
>>>>> What does:
>>>>>
>>>>> sacctmgr list runawayjobs
>>>>>
>>>>> say?
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190306/0905cac6/attachment-0001.html>
More information about the slurm-users
mailing list